xref: /OK3568_Linux_fs/kernel/Documentation/networking/scaling.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=====================================
4*4882a593SmuzhiyunScaling in the Linux Networking Stack
5*4882a593Smuzhiyun=====================================
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunIntroduction
9*4882a593Smuzhiyun============
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunThis document describes a set of complementary techniques in the Linux
12*4882a593Smuzhiyunnetworking stack to increase parallelism and improve performance for
13*4882a593Smuzhiyunmulti-processor systems.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunThe following technologies are described:
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun- RSS: Receive Side Scaling
18*4882a593Smuzhiyun- RPS: Receive Packet Steering
19*4882a593Smuzhiyun- RFS: Receive Flow Steering
20*4882a593Smuzhiyun- Accelerated Receive Flow Steering
21*4882a593Smuzhiyun- XPS: Transmit Packet Steering
22*4882a593Smuzhiyun
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunRSS: Receive Side Scaling
25*4882a593Smuzhiyun=========================
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunContemporary NICs support multiple receive and transmit descriptor queues
28*4882a593Smuzhiyun(multi-queue). On reception, a NIC can send different packets to different
29*4882a593Smuzhiyunqueues to distribute processing among CPUs. The NIC distributes packets by
30*4882a593Smuzhiyunapplying a filter to each packet that assigns it to one of a small number
31*4882a593Smuzhiyunof logical flows. Packets for each flow are steered to a separate receive
32*4882a593Smuzhiyunqueue, which in turn can be processed by separate CPUs. This mechanism is
33*4882a593Smuzhiyungenerally known as “Receive-side Scaling” (RSS). The goal of RSS and
34*4882a593Smuzhiyunthe other scaling techniques is to increase performance uniformly.
35*4882a593SmuzhiyunMulti-queue distribution can also be used for traffic prioritization, but
36*4882a593Smuzhiyunthat is not the focus of these techniques.
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunThe filter used in RSS is typically a hash function over the network
39*4882a593Smuzhiyunand/or transport layer headers-- for example, a 4-tuple hash over
40*4882a593SmuzhiyunIP addresses and TCP ports of a packet. The most common hardware
41*4882a593Smuzhiyunimplementation of RSS uses a 128-entry indirection table where each entry
42*4882a593Smuzhiyunstores a queue number. The receive queue for a packet is determined
43*4882a593Smuzhiyunby masking out the low order seven bits of the computed hash for the
44*4882a593Smuzhiyunpacket (usually a Toeplitz hash), taking this number as a key into the
45*4882a593Smuzhiyunindirection table and reading the corresponding value.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunSome advanced NICs allow steering packets to queues based on
48*4882a593Smuzhiyunprogrammable filters. For example, webserver bound TCP port 80 packets
49*4882a593Smuzhiyuncan be directed to their own receive queue. Such “n-tuple” filters can
50*4882a593Smuzhiyunbe configured from ethtool (--config-ntuple).
51*4882a593Smuzhiyun
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunRSS Configuration
54*4882a593Smuzhiyun-----------------
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunThe driver for a multi-queue capable NIC typically provides a kernel
57*4882a593Smuzhiyunmodule parameter for specifying the number of hardware queues to
58*4882a593Smuzhiyunconfigure. In the bnx2x driver, for instance, this parameter is called
59*4882a593Smuzhiyunnum_queues. A typical RSS configuration would be to have one receive queue
60*4882a593Smuzhiyunfor each CPU if the device supports enough queues, or otherwise at least
61*4882a593Smuzhiyunone for each memory domain, where a memory domain is a set of CPUs that
62*4882a593Smuzhiyunshare a particular memory level (L1, L2, NUMA node, etc.).
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunThe indirection table of an RSS device, which resolves a queue by masked
65*4882a593Smuzhiyunhash, is usually programmed by the driver at initialization. The
66*4882a593Smuzhiyundefault mapping is to distribute the queues evenly in the table, but the
67*4882a593Smuzhiyunindirection table can be retrieved and modified at runtime using ethtool
68*4882a593Smuzhiyuncommands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
69*4882a593Smuzhiyunindirection table could be done to give different queues different
70*4882a593Smuzhiyunrelative weights.
71*4882a593Smuzhiyun
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunRSS IRQ Configuration
74*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunEach receive queue has a separate IRQ associated with it. The NIC triggers
77*4882a593Smuzhiyunthis to notify a CPU when new packets arrive on the given queue. The
78*4882a593Smuzhiyunsignaling path for PCIe devices uses message signaled interrupts (MSI-X),
79*4882a593Smuzhiyunthat can route each interrupt to a particular CPU. The active mapping
80*4882a593Smuzhiyunof queues to IRQs can be determined from /proc/interrupts. By default,
81*4882a593Smuzhiyunan IRQ may be handled on any CPU. Because a non-negligible part of packet
82*4882a593Smuzhiyunprocessing takes place in receive interrupt handling, it is advantageous
83*4882a593Smuzhiyunto spread receive interrupts between CPUs. To manually adjust the IRQ
84*4882a593Smuzhiyunaffinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
85*4882a593Smuzhiyunwill be running irqbalance, a daemon that dynamically optimizes IRQ
86*4882a593Smuzhiyunassignments and as a result may override any manual settings.
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunSuggested Configuration
90*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunRSS should be enabled when latency is a concern or whenever receive
93*4882a593Smuzhiyuninterrupt processing forms a bottleneck. Spreading load between CPUs
94*4882a593Smuzhiyundecreases queue length. For low latency networking, the optimal setting
95*4882a593Smuzhiyunis to allocate as many queues as there are CPUs in the system (or the
96*4882a593SmuzhiyunNIC maximum, if lower). The most efficient high-rate configuration
97*4882a593Smuzhiyunis likely the one with the smallest number of receive queues where no
98*4882a593Smuzhiyunreceive queue overflows due to a saturated CPU, because in default
99*4882a593Smuzhiyunmode with interrupt coalescing enabled, the aggregate number of
100*4882a593Smuzhiyuninterrupts (and thus work) grows with each additional queue.
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunPer-cpu load can be observed using the mpstat utility, but note that on
103*4882a593Smuzhiyunprocessors with hyperthreading (HT), each hyperthread is represented as
104*4882a593Smuzhiyuna separate CPU. For interrupt handling, HT has shown no benefit in
105*4882a593Smuzhiyuninitial tests, so limit the number of queues to the number of CPU cores
106*4882a593Smuzhiyunin the system.
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunRPS: Receive Packet Steering
110*4882a593Smuzhiyun============================
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunReceive Packet Steering (RPS) is logically a software implementation of
113*4882a593SmuzhiyunRSS. Being in software, it is necessarily called later in the datapath.
114*4882a593SmuzhiyunWhereas RSS selects the queue and hence CPU that will run the hardware
115*4882a593Smuzhiyuninterrupt handler, RPS selects the CPU to perform protocol processing
116*4882a593Smuzhiyunabove the interrupt handler. This is accomplished by placing the packet
117*4882a593Smuzhiyunon the desired CPU’s backlog queue and waking up the CPU for processing.
118*4882a593SmuzhiyunRPS has some advantages over RSS:
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun1) it can be used with any NIC
121*4882a593Smuzhiyun2) software filters can easily be added to hash over new protocols
122*4882a593Smuzhiyun3) it does not increase hardware device interrupt rate (although it does
123*4882a593Smuzhiyun   introduce inter-processor interrupts (IPIs))
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunRPS is called during bottom half of the receive interrupt handler, when
126*4882a593Smuzhiyuna driver sends a packet up the network stack with netif_rx() or
127*4882a593Smuzhiyunnetif_receive_skb(). These call the get_rps_cpu() function, which
128*4882a593Smuzhiyunselects the queue that should process a packet.
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunThe first step in determining the target CPU for RPS is to calculate a
131*4882a593Smuzhiyunflow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
132*4882a593Smuzhiyundepending on the protocol). This serves as a consistent hash of the
133*4882a593Smuzhiyunassociated flow of the packet. The hash is either provided by hardware
134*4882a593Smuzhiyunor will be computed in the stack. Capable hardware can pass the hash in
135*4882a593Smuzhiyunthe receive descriptor for the packet; this would usually be the same
136*4882a593Smuzhiyunhash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
137*4882a593Smuzhiyunskb->hash and can be used elsewhere in the stack as a hash of the
138*4882a593Smuzhiyunpacket’s flow.
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunEach receive hardware queue has an associated list of CPUs to which
141*4882a593SmuzhiyunRPS may enqueue packets for processing. For each received packet,
142*4882a593Smuzhiyunan index into the list is computed from the flow hash modulo the size
143*4882a593Smuzhiyunof the list. The indexed CPU is the target for processing the packet,
144*4882a593Smuzhiyunand the packet is queued to the tail of that CPU’s backlog queue. At
145*4882a593Smuzhiyunthe end of the bottom half routine, IPIs are sent to any CPUs for which
146*4882a593Smuzhiyunpackets have been queued to their backlog queue. The IPI wakes backlog
147*4882a593Smuzhiyunprocessing on the remote CPU, and any queued packets are then processed
148*4882a593Smuzhiyunup the networking stack.
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun
151*4882a593SmuzhiyunRPS Configuration
152*4882a593Smuzhiyun-----------------
153*4882a593Smuzhiyun
154*4882a593SmuzhiyunRPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
155*4882a593Smuzhiyunby default for SMP). Even when compiled in, RPS remains disabled until
156*4882a593Smuzhiyunexplicitly configured. The list of CPUs to which RPS may forward traffic
157*4882a593Smuzhiyuncan be configured for each receive queue using a sysfs file entry::
158*4882a593Smuzhiyun
159*4882a593Smuzhiyun  /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunThis file implements a bitmap of CPUs. RPS is disabled when it is zero
162*4882a593Smuzhiyun(the default), in which case packets are processed on the interrupting
163*4882a593SmuzhiyunCPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
164*4882a593Smuzhiyunthe bitmap.
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun
167*4882a593SmuzhiyunSuggested Configuration
168*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~
169*4882a593Smuzhiyun
170*4882a593SmuzhiyunFor a single queue device, a typical RPS configuration would be to set
171*4882a593Smuzhiyunthe rps_cpus to the CPUs in the same memory domain of the interrupting
172*4882a593SmuzhiyunCPU. If NUMA locality is not an issue, this could also be all CPUs in
173*4882a593Smuzhiyunthe system. At high interrupt rate, it might be wise to exclude the
174*4882a593Smuzhiyuninterrupting CPU from the map since that already performs much work.
175*4882a593Smuzhiyun
176*4882a593SmuzhiyunFor a multi-queue system, if RSS is configured so that a hardware
177*4882a593Smuzhiyunreceive queue is mapped to each CPU, then RPS is probably redundant
178*4882a593Smuzhiyunand unnecessary. If there are fewer hardware queues than CPUs, then
179*4882a593SmuzhiyunRPS might be beneficial if the rps_cpus for each queue are the ones that
180*4882a593Smuzhiyunshare the same memory domain as the interrupting CPU for that queue.
181*4882a593Smuzhiyun
182*4882a593Smuzhiyun
183*4882a593SmuzhiyunRPS Flow Limit
184*4882a593Smuzhiyun--------------
185*4882a593Smuzhiyun
186*4882a593SmuzhiyunRPS scales kernel receive processing across CPUs without introducing
187*4882a593Smuzhiyunreordering. The trade-off to sending all packets from the same flow
188*4882a593Smuzhiyunto the same CPU is CPU load imbalance if flows vary in packet rate.
189*4882a593SmuzhiyunIn the extreme case a single flow dominates traffic. Especially on
190*4882a593Smuzhiyuncommon server workloads with many concurrent connections, such
191*4882a593Smuzhiyunbehavior indicates a problem such as a misconfiguration or spoofed
192*4882a593Smuzhiyunsource Denial of Service attack.
193*4882a593Smuzhiyun
194*4882a593SmuzhiyunFlow Limit is an optional RPS feature that prioritizes small flows
195*4882a593Smuzhiyunduring CPU contention by dropping packets from large flows slightly
196*4882a593Smuzhiyunahead of those from small flows. It is active only when an RPS or RFS
197*4882a593Smuzhiyundestination CPU approaches saturation.  Once a CPU's input packet
198*4882a593Smuzhiyunqueue exceeds half the maximum queue length (as set by sysctl
199*4882a593Smuzhiyunnet.core.netdev_max_backlog), the kernel starts a per-flow packet
200*4882a593Smuzhiyuncount over the last 256 packets. If a flow exceeds a set ratio (by
201*4882a593Smuzhiyundefault, half) of these packets when a new packet arrives, then the
202*4882a593Smuzhiyunnew packet is dropped. Packets from other flows are still only
203*4882a593Smuzhiyundropped once the input packet queue reaches netdev_max_backlog.
204*4882a593SmuzhiyunNo packets are dropped when the input packet queue length is below
205*4882a593Smuzhiyunthe threshold, so flow limit does not sever connections outright:
206*4882a593Smuzhiyuneven large flows maintain connectivity.
207*4882a593Smuzhiyun
208*4882a593Smuzhiyun
209*4882a593SmuzhiyunInterface
210*4882a593Smuzhiyun~~~~~~~~~
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunFlow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
213*4882a593Smuzhiyunturned on. It is implemented for each CPU independently (to avoid lock
214*4882a593Smuzhiyunand cache contention) and toggled per CPU by setting the relevant bit
215*4882a593Smuzhiyunin sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
216*4882a593Smuzhiyunbitmap interface as rps_cpus (see above) when called from procfs::
217*4882a593Smuzhiyun
218*4882a593Smuzhiyun  /proc/sys/net/core/flow_limit_cpu_bitmap
219*4882a593Smuzhiyun
220*4882a593SmuzhiyunPer-flow rate is calculated by hashing each packet into a hashtable
221*4882a593Smuzhiyunbucket and incrementing a per-bucket counter. The hash function is
222*4882a593Smuzhiyunthe same that selects a CPU in RPS, but as the number of buckets can
223*4882a593Smuzhiyunbe much larger than the number of CPUs, flow limit has finer-grained
224*4882a593Smuzhiyunidentification of large flows and fewer false positives. The default
225*4882a593Smuzhiyuntable has 4096 buckets. This value can be modified through sysctl::
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun  net.core.flow_limit_table_len
228*4882a593Smuzhiyun
229*4882a593SmuzhiyunThe value is only consulted when a new table is allocated. Modifying
230*4882a593Smuzhiyunit does not update active tables.
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunSuggested Configuration
234*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~
235*4882a593Smuzhiyun
236*4882a593SmuzhiyunFlow limit is useful on systems with many concurrent connections,
237*4882a593Smuzhiyunwhere a single connection taking up 50% of a CPU indicates a problem.
238*4882a593SmuzhiyunIn such environments, enable the feature on all CPUs that handle
239*4882a593Smuzhiyunnetwork rx interrupts (as set in /proc/irq/N/smp_affinity).
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunThe feature depends on the input packet queue length to exceed
242*4882a593Smuzhiyunthe flow limit threshold (50%) + the flow history length (256).
243*4882a593SmuzhiyunSetting net.core.netdev_max_backlog to either 1000 or 10000
244*4882a593Smuzhiyunperformed well in experiments.
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun
247*4882a593SmuzhiyunRFS: Receive Flow Steering
248*4882a593Smuzhiyun==========================
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunWhile RPS steers packets solely based on hash, and thus generally
251*4882a593Smuzhiyunprovides good load distribution, it does not take into account
252*4882a593Smuzhiyunapplication locality. This is accomplished by Receive Flow Steering
253*4882a593Smuzhiyun(RFS). The goal of RFS is to increase datacache hitrate by steering
254*4882a593Smuzhiyunkernel processing of packets to the CPU where the application thread
255*4882a593Smuzhiyunconsuming the packet is running. RFS relies on the same RPS mechanisms
256*4882a593Smuzhiyunto enqueue packets onto the backlog of another CPU and to wake up that
257*4882a593SmuzhiyunCPU.
258*4882a593Smuzhiyun
259*4882a593SmuzhiyunIn RFS, packets are not forwarded directly by the value of their hash,
260*4882a593Smuzhiyunbut the hash is used as index into a flow lookup table. This table maps
261*4882a593Smuzhiyunflows to the CPUs where those flows are being processed. The flow hash
262*4882a593Smuzhiyun(see RPS section above) is used to calculate the index into this table.
263*4882a593SmuzhiyunThe CPU recorded in each entry is the one which last processed the flow.
264*4882a593SmuzhiyunIf an entry does not hold a valid CPU, then packets mapped to that entry
265*4882a593Smuzhiyunare steered using plain RPS. Multiple table entries may point to the
266*4882a593Smuzhiyunsame CPU. Indeed, with many flows and few CPUs, it is very likely that
267*4882a593Smuzhiyuna single application thread handles flows with many different flow hashes.
268*4882a593Smuzhiyun
269*4882a593Smuzhiyunrps_sock_flow_table is a global flow table that contains the *desired* CPU
270*4882a593Smuzhiyunfor flows: the CPU that is currently processing the flow in userspace.
271*4882a593SmuzhiyunEach table value is a CPU index that is updated during calls to recvmsg
272*4882a593Smuzhiyunand sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
273*4882a593Smuzhiyunand tcp_splice_read()).
274*4882a593Smuzhiyun
275*4882a593SmuzhiyunWhen the scheduler moves a thread to a new CPU while it has outstanding
276*4882a593Smuzhiyunreceive packets on the old CPU, packets may arrive out of order. To
277*4882a593Smuzhiyunavoid this, RFS uses a second flow table to track outstanding packets
278*4882a593Smuzhiyunfor each flow: rps_dev_flow_table is a table specific to each hardware
279*4882a593Smuzhiyunreceive queue of each device. Each table value stores a CPU index and a
280*4882a593Smuzhiyuncounter. The CPU index represents the *current* CPU onto which packets
281*4882a593Smuzhiyunfor this flow are enqueued for further kernel processing. Ideally, kernel
282*4882a593Smuzhiyunand userspace processing occur on the same CPU, and hence the CPU index
283*4882a593Smuzhiyunin both tables is identical. This is likely false if the scheduler has
284*4882a593Smuzhiyunrecently migrated a userspace thread while the kernel still has packets
285*4882a593Smuzhiyunenqueued for kernel processing on the old CPU.
286*4882a593Smuzhiyun
287*4882a593SmuzhiyunThe counter in rps_dev_flow_table values records the length of the current
288*4882a593SmuzhiyunCPU's backlog when a packet in this flow was last enqueued. Each backlog
289*4882a593Smuzhiyunqueue has a head counter that is incremented on dequeue. A tail counter
290*4882a593Smuzhiyunis computed as head counter + queue length. In other words, the counter
291*4882a593Smuzhiyunin rps_dev_flow[i] records the last element in flow i that has
292*4882a593Smuzhiyunbeen enqueued onto the currently designated CPU for flow i (of course,
293*4882a593Smuzhiyunentry i is actually selected by hash and multiple flows may hash to the
294*4882a593Smuzhiyunsame entry i).
295*4882a593Smuzhiyun
296*4882a593SmuzhiyunAnd now the trick for avoiding out of order packets: when selecting the
297*4882a593SmuzhiyunCPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
298*4882a593Smuzhiyunand the rps_dev_flow table of the queue that the packet was received on
299*4882a593Smuzhiyunare compared. If the desired CPU for the flow (found in the
300*4882a593Smuzhiyunrps_sock_flow table) matches the current CPU (found in the rps_dev_flow
301*4882a593Smuzhiyuntable), the packet is enqueued onto that CPU’s backlog. If they differ,
302*4882a593Smuzhiyunthe current CPU is updated to match the desired CPU if one of the
303*4882a593Smuzhiyunfollowing is true:
304*4882a593Smuzhiyun
305*4882a593Smuzhiyun  - The current CPU's queue head counter >= the recorded tail counter
306*4882a593Smuzhiyun    value in rps_dev_flow[i]
307*4882a593Smuzhiyun  - The current CPU is unset (>= nr_cpu_ids)
308*4882a593Smuzhiyun  - The current CPU is offline
309*4882a593Smuzhiyun
310*4882a593SmuzhiyunAfter this check, the packet is sent to the (possibly updated) current
311*4882a593SmuzhiyunCPU. These rules aim to ensure that a flow only moves to a new CPU when
312*4882a593Smuzhiyunthere are no packets outstanding on the old CPU, as the outstanding
313*4882a593Smuzhiyunpackets could arrive later than those about to be processed on the new
314*4882a593SmuzhiyunCPU.
315*4882a593Smuzhiyun
316*4882a593Smuzhiyun
317*4882a593SmuzhiyunRFS Configuration
318*4882a593Smuzhiyun-----------------
319*4882a593Smuzhiyun
320*4882a593SmuzhiyunRFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
321*4882a593Smuzhiyunby default for SMP). The functionality remains disabled until explicitly
322*4882a593Smuzhiyunconfigured. The number of entries in the global flow table is set through::
323*4882a593Smuzhiyun
324*4882a593Smuzhiyun  /proc/sys/net/core/rps_sock_flow_entries
325*4882a593Smuzhiyun
326*4882a593SmuzhiyunThe number of entries in the per-queue flow table are set through::
327*4882a593Smuzhiyun
328*4882a593Smuzhiyun  /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
329*4882a593Smuzhiyun
330*4882a593Smuzhiyun
331*4882a593SmuzhiyunSuggested Configuration
332*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~
333*4882a593Smuzhiyun
334*4882a593SmuzhiyunBoth of these need to be set before RFS is enabled for a receive queue.
335*4882a593SmuzhiyunValues for both are rounded up to the nearest power of two. The
336*4882a593Smuzhiyunsuggested flow count depends on the expected number of active connections
337*4882a593Smuzhiyunat any given time, which may be significantly less than the number of open
338*4882a593Smuzhiyunconnections. We have found that a value of 32768 for rps_sock_flow_entries
339*4882a593Smuzhiyunworks fairly well on a moderately loaded server.
340*4882a593Smuzhiyun
341*4882a593SmuzhiyunFor a single queue device, the rps_flow_cnt value for the single queue
342*4882a593Smuzhiyunwould normally be configured to the same value as rps_sock_flow_entries.
343*4882a593SmuzhiyunFor a multi-queue device, the rps_flow_cnt for each queue might be
344*4882a593Smuzhiyunconfigured as rps_sock_flow_entries / N, where N is the number of
345*4882a593Smuzhiyunqueues. So for instance, if rps_sock_flow_entries is set to 32768 and there
346*4882a593Smuzhiyunare 16 configured receive queues, rps_flow_cnt for each queue might be
347*4882a593Smuzhiyunconfigured as 2048.
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun
350*4882a593SmuzhiyunAccelerated RFS
351*4882a593Smuzhiyun===============
352*4882a593Smuzhiyun
353*4882a593SmuzhiyunAccelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
354*4882a593Smuzhiyunbalancing mechanism that uses soft state to steer flows based on where
355*4882a593Smuzhiyunthe application thread consuming the packets of each flow is running.
356*4882a593SmuzhiyunAccelerated RFS should perform better than RFS since packets are sent
357*4882a593Smuzhiyundirectly to a CPU local to the thread consuming the data. The target CPU
358*4882a593Smuzhiyunwill either be the same CPU where the application runs, or at least a CPU
359*4882a593Smuzhiyunwhich is local to the application thread’s CPU in the cache hierarchy.
360*4882a593Smuzhiyun
361*4882a593SmuzhiyunTo enable accelerated RFS, the networking stack calls the
362*4882a593Smuzhiyunndo_rx_flow_steer driver function to communicate the desired hardware
363*4882a593Smuzhiyunqueue for packets matching a particular flow. The network stack
364*4882a593Smuzhiyunautomatically calls this function every time a flow entry in
365*4882a593Smuzhiyunrps_dev_flow_table is updated. The driver in turn uses a device specific
366*4882a593Smuzhiyunmethod to program the NIC to steer the packets.
367*4882a593Smuzhiyun
368*4882a593SmuzhiyunThe hardware queue for a flow is derived from the CPU recorded in
369*4882a593Smuzhiyunrps_dev_flow_table. The stack consults a CPU to hardware queue map which
370*4882a593Smuzhiyunis maintained by the NIC driver. This is an auto-generated reverse map of
371*4882a593Smuzhiyunthe IRQ affinity table shown by /proc/interrupts. Drivers can use
372*4882a593Smuzhiyunfunctions in the cpu_rmap (“CPU affinity reverse map”) kernel library
373*4882a593Smuzhiyunto populate the map. For each CPU, the corresponding queue in the map is
374*4882a593Smuzhiyunset to be one whose processing CPU is closest in cache locality.
375*4882a593Smuzhiyun
376*4882a593Smuzhiyun
377*4882a593SmuzhiyunAccelerated RFS Configuration
378*4882a593Smuzhiyun-----------------------------
379*4882a593Smuzhiyun
380*4882a593SmuzhiyunAccelerated RFS is only available if the kernel is compiled with
381*4882a593SmuzhiyunCONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
382*4882a593SmuzhiyunIt also requires that ntuple filtering is enabled via ethtool. The map
383*4882a593Smuzhiyunof CPU to queues is automatically deduced from the IRQ affinities
384*4882a593Smuzhiyunconfigured for each receive queue by the driver, so no additional
385*4882a593Smuzhiyunconfiguration should be necessary.
386*4882a593Smuzhiyun
387*4882a593Smuzhiyun
388*4882a593SmuzhiyunSuggested Configuration
389*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~
390*4882a593Smuzhiyun
391*4882a593SmuzhiyunThis technique should be enabled whenever one wants to use RFS and the
392*4882a593SmuzhiyunNIC supports hardware acceleration.
393*4882a593Smuzhiyun
394*4882a593Smuzhiyun
395*4882a593SmuzhiyunXPS: Transmit Packet Steering
396*4882a593Smuzhiyun=============================
397*4882a593Smuzhiyun
398*4882a593SmuzhiyunTransmit Packet Steering is a mechanism for intelligently selecting
399*4882a593Smuzhiyunwhich transmit queue to use when transmitting a packet on a multi-queue
400*4882a593Smuzhiyundevice. This can be accomplished by recording two kinds of maps, either
401*4882a593Smuzhiyuna mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
402*4882a593Smuzhiyunto hardware transmit queue(s).
403*4882a593Smuzhiyun
404*4882a593Smuzhiyun1. XPS using CPUs map
405*4882a593Smuzhiyun
406*4882a593SmuzhiyunThe goal of this mapping is usually to assign queues
407*4882a593Smuzhiyunexclusively to a subset of CPUs, where the transmit completions for
408*4882a593Smuzhiyunthese queues are processed on a CPU within this set. This choice
409*4882a593Smuzhiyunprovides two benefits. First, contention on the device queue lock is
410*4882a593Smuzhiyunsignificantly reduced since fewer CPUs contend for the same queue
411*4882a593Smuzhiyun(contention can be eliminated completely if each CPU has its own
412*4882a593Smuzhiyuntransmit queue). Secondly, cache miss rate on transmit completion is
413*4882a593Smuzhiyunreduced, in particular for data cache lines that hold the sk_buff
414*4882a593Smuzhiyunstructures.
415*4882a593Smuzhiyun
416*4882a593Smuzhiyun2. XPS using receive queues map
417*4882a593Smuzhiyun
418*4882a593SmuzhiyunThis mapping is used to pick transmit queue based on the receive
419*4882a593Smuzhiyunqueue(s) map configuration set by the administrator. A set of receive
420*4882a593Smuzhiyunqueues can be mapped to a set of transmit queues (many:many), although
421*4882a593Smuzhiyunthe common use case is a 1:1 mapping. This will enable sending packets
422*4882a593Smuzhiyunon the same queue associations for transmit and receive. This is useful for
423*4882a593Smuzhiyunbusy polling multi-threaded workloads where there are challenges in
424*4882a593Smuzhiyunassociating a given CPU to a given application thread. The application
425*4882a593Smuzhiyunthreads are not pinned to CPUs and each thread handles packets
426*4882a593Smuzhiyunreceived on a single queue. The receive queue number is cached in the
427*4882a593Smuzhiyunsocket for the connection. In this model, sending the packets on the same
428*4882a593Smuzhiyuntransmit queue corresponding to the associated receive queue has benefits
429*4882a593Smuzhiyunin keeping the CPU overhead low. Transmit completion work is locked into
430*4882a593Smuzhiyunthe same queue-association that a given application is polling on. This
431*4882a593Smuzhiyunavoids the overhead of triggering an interrupt on another CPU. When the
432*4882a593Smuzhiyunapplication cleans up the packets during the busy poll, transmit completion
433*4882a593Smuzhiyunmay be processed along with it in the same thread context and so result in
434*4882a593Smuzhiyunreduced latency.
435*4882a593Smuzhiyun
436*4882a593SmuzhiyunXPS is configured per transmit queue by setting a bitmap of
437*4882a593SmuzhiyunCPUs/receive-queues that may use that queue to transmit. The reverse
438*4882a593Smuzhiyunmapping, from CPUs to transmit queues or from receive-queues to transmit
439*4882a593Smuzhiyunqueues, is computed and maintained for each network device. When
440*4882a593Smuzhiyuntransmitting the first packet in a flow, the function get_xps_queue() is
441*4882a593Smuzhiyuncalled to select a queue. This function uses the ID of the receive queue
442*4882a593Smuzhiyunfor the socket connection for a match in the receive queue-to-transmit queue
443*4882a593Smuzhiyunlookup table. Alternatively, this function can also use the ID of the
444*4882a593Smuzhiyunrunning CPU as a key into the CPU-to-queue lookup table. If the
445*4882a593SmuzhiyunID matches a single queue, that is used for transmission. If multiple
446*4882a593Smuzhiyunqueues match, one is selected by using the flow hash to compute an index
447*4882a593Smuzhiyuninto the set. When selecting the transmit queue based on receive queue(s)
448*4882a593Smuzhiyunmap, the transmit device is not validated against the receive device as it
449*4882a593Smuzhiyunrequires expensive lookup operation in the datapath.
450*4882a593Smuzhiyun
451*4882a593SmuzhiyunThe queue chosen for transmitting a particular flow is saved in the
452*4882a593Smuzhiyuncorresponding socket structure for the flow (e.g. a TCP connection).
453*4882a593SmuzhiyunThis transmit queue is used for subsequent packets sent on the flow to
454*4882a593Smuzhiyunprevent out of order (ooo) packets. The choice also amortizes the cost
455*4882a593Smuzhiyunof calling get_xps_queues() over all packets in the flow. To avoid
456*4882a593Smuzhiyunooo packets, the queue for a flow can subsequently only be changed if
457*4882a593Smuzhiyunskb->ooo_okay is set for a packet in the flow. This flag indicates that
458*4882a593Smuzhiyunthere are no outstanding packets in the flow, so the transmit queue can
459*4882a593Smuzhiyunchange without the risk of generating out of order packets. The
460*4882a593Smuzhiyuntransport layer is responsible for setting ooo_okay appropriately. TCP,
461*4882a593Smuzhiyunfor instance, sets the flag when all data for a connection has been
462*4882a593Smuzhiyunacknowledged.
463*4882a593Smuzhiyun
464*4882a593SmuzhiyunXPS Configuration
465*4882a593Smuzhiyun-----------------
466*4882a593Smuzhiyun
467*4882a593SmuzhiyunXPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
468*4882a593Smuzhiyundefault for SMP). If compiled in, it is driver dependent whether, and
469*4882a593Smuzhiyunhow, XPS is configured at device init. The mapping of CPUs/receive-queues
470*4882a593Smuzhiyunto transmit queue can be inspected and configured using sysfs:
471*4882a593Smuzhiyun
472*4882a593SmuzhiyunFor selection based on CPUs map::
473*4882a593Smuzhiyun
474*4882a593Smuzhiyun  /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
475*4882a593Smuzhiyun
476*4882a593SmuzhiyunFor selection based on receive-queues map::
477*4882a593Smuzhiyun
478*4882a593Smuzhiyun  /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
479*4882a593Smuzhiyun
480*4882a593Smuzhiyun
481*4882a593SmuzhiyunSuggested Configuration
482*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~
483*4882a593Smuzhiyun
484*4882a593SmuzhiyunFor a network device with a single transmission queue, XPS configuration
485*4882a593Smuzhiyunhas no effect, since there is no choice in this case. In a multi-queue
486*4882a593Smuzhiyunsystem, XPS is preferably configured so that each CPU maps onto one queue.
487*4882a593SmuzhiyunIf there are as many queues as there are CPUs in the system, then each
488*4882a593Smuzhiyunqueue can also map onto one CPU, resulting in exclusive pairings that
489*4882a593Smuzhiyunexperience no contention. If there are fewer queues than CPUs, then the
490*4882a593Smuzhiyunbest CPUs to share a given queue are probably those that share the cache
491*4882a593Smuzhiyunwith the CPU that processes transmit completions for that queue
492*4882a593Smuzhiyun(transmit interrupts).
493*4882a593Smuzhiyun
494*4882a593SmuzhiyunFor transmit queue selection based on receive queue(s), XPS has to be
495*4882a593Smuzhiyunexplicitly configured mapping receive-queue(s) to transmit queue(s). If the
496*4882a593Smuzhiyunuser configuration for receive-queue map does not apply, then the transmit
497*4882a593Smuzhiyunqueue is selected based on the CPUs map.
498*4882a593Smuzhiyun
499*4882a593Smuzhiyun
500*4882a593SmuzhiyunPer TX Queue rate limitation
501*4882a593Smuzhiyun============================
502*4882a593Smuzhiyun
503*4882a593SmuzhiyunThese are rate-limitation mechanisms implemented by HW, where currently
504*4882a593Smuzhiyuna max-rate attribute is supported, by setting a Mbps value to::
505*4882a593Smuzhiyun
506*4882a593Smuzhiyun  /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
507*4882a593Smuzhiyun
508*4882a593SmuzhiyunA value of zero means disabled, and this is the default.
509*4882a593Smuzhiyun
510*4882a593Smuzhiyun
511*4882a593SmuzhiyunFurther Information
512*4882a593Smuzhiyun===================
513*4882a593SmuzhiyunRPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
514*4882a593Smuzhiyun2.6.38. Original patches were submitted by Tom Herbert
515*4882a593Smuzhiyun(therbert@google.com)
516*4882a593Smuzhiyun
517*4882a593SmuzhiyunAccelerated RFS was introduced in 2.6.35. Original patches were
518*4882a593Smuzhiyunsubmitted by Ben Hutchings (bwh@kernel.org)
519*4882a593Smuzhiyun
520*4882a593SmuzhiyunAuthors:
521*4882a593Smuzhiyun
522*4882a593Smuzhiyun- Tom Herbert (therbert@google.com)
523*4882a593Smuzhiyun- Willem de Bruijn (willemb@google.com)
524