xref: /OK3568_Linux_fs/kernel/Documentation/networking/snmp_counter.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun============
2*4882a593SmuzhiyunSNMP counter
3*4882a593Smuzhiyun============
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThis document explains the meaning of SNMP counters.
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunGeneral IPv4 counters
8*4882a593Smuzhiyun=====================
9*4882a593SmuzhiyunAll layer 4 packets and ICMP packets will change these counters, but
10*4882a593Smuzhiyunthese counters won't be changed by layer 2 packets (such as STP) or
11*4882a593SmuzhiyunARP packets.
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun* IpInReceives
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunDefined in `RFC1213 ipInReceives`_
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunThe number of packets received by the IP layer. It gets increasing at the
20*4882a593Smuzhiyunbeginning of ip_rcv function, always be updated together with
21*4882a593SmuzhiyunIpExtInOctets. It will be increased even if the packet is dropped
22*4882a593Smuzhiyunlater (e.g. due to the IP header is invalid or the checksum is wrong
23*4882a593Smuzhiyunand so on).  It indicates the number of aggregated segments after
24*4882a593SmuzhiyunGRO/LRO.
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun* IpInDelivers
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunDefined in `RFC1213 ipInDelivers`_
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunThe number of packets delivers to the upper layer protocols. E.g. TCP, UDP,
33*4882a593SmuzhiyunICMP and so on. If no one listens on a raw socket, only kernel
34*4882a593Smuzhiyunsupported protocols will be delivered, if someone listens on the raw
35*4882a593Smuzhiyunsocket, all valid IP packets will be delivered.
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun* IpOutRequests
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunDefined in `RFC1213 ipOutRequests`_
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunThe number of packets sent via IP layer, for both single cast and
44*4882a593Smuzhiyunmulticast packets, and would always be updated together with
45*4882a593SmuzhiyunIpExtOutOctets.
46*4882a593Smuzhiyun
47*4882a593Smuzhiyun* IpExtInOctets and IpExtOutOctets
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunThey are Linux kernel extensions, no RFC definitions. Please note,
50*4882a593SmuzhiyunRFC1213 indeed defines ifInOctets  and ifOutOctets, but they
51*4882a593Smuzhiyunare different things. The ifInOctets and ifOutOctets include the MAC
52*4882a593Smuzhiyunlayer header size but IpExtInOctets and IpExtOutOctets don't, they
53*4882a593Smuzhiyunonly include the IP layer header and the IP layer data.
54*4882a593Smuzhiyun
55*4882a593Smuzhiyun* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunThey indicate the number of four kinds of ECN IP packets, please refer
58*4882a593Smuzhiyun`Explicit Congestion Notification`_ for more details.
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunThese 4 counters calculate how many packets received per ECN
63*4882a593Smuzhiyunstatus. They count the real frame number regardless the LRO/GRO. So
64*4882a593Smuzhiyunfor the same packet, you might find that IpInReceives count 1, but
65*4882a593SmuzhiyunIpExtInNoECTPkts counts 2 or more.
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun* IpInHdrErrors
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunDefined in `RFC1213 ipInHdrErrors`_. It indicates the packet is
70*4882a593Smuzhiyundropped due to the IP header error. It might happen in both IP input
71*4882a593Smuzhiyunand IP forward paths.
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun* IpInAddrErrors
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunDefined in `RFC1213 ipInAddrErrors`_. It will be increased in two
78*4882a593Smuzhiyunscenarios: (1) The IP address is invalid. (2) The destination IP
79*4882a593Smuzhiyunaddress is not a local address and IP forwarding is not enabled
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun* IpExtInNoRoutes
84*4882a593Smuzhiyun
85*4882a593SmuzhiyunThis counter means the packet is dropped when the IP stack receives a
86*4882a593Smuzhiyunpacket and can't find a route for it from the route table. It might
87*4882a593Smuzhiyunhappen when IP forwarding is enabled and the destination IP address is
88*4882a593Smuzhiyunnot a local address and there is no route for the destination IP
89*4882a593Smuzhiyunaddress.
90*4882a593Smuzhiyun
91*4882a593Smuzhiyun* IpInUnknownProtos
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunDefined in `RFC1213 ipInUnknownProtos`_. It will be increased if the
94*4882a593Smuzhiyunlayer 4 protocol is unsupported by kernel. If an application is using
95*4882a593Smuzhiyunraw socket, kernel will always deliver the packet to the raw socket
96*4882a593Smuzhiyunand this counter won't be increased.
97*4882a593Smuzhiyun
98*4882a593Smuzhiyun.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27
99*4882a593Smuzhiyun
100*4882a593Smuzhiyun* IpExtInTruncatedPkts
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunFor IPv4 packet, it means the actual data size is smaller than the
103*4882a593Smuzhiyun"Total Length" field in the IPv4 header.
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun* IpInDiscards
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunDefined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped
108*4882a593Smuzhiyunin the IP receiving path and due to kernel internal reasons (e.g. no
109*4882a593Smuzhiyunenough memory).
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28
112*4882a593Smuzhiyun
113*4882a593Smuzhiyun* IpOutDiscards
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunDefined in `RFC1213 ipOutDiscards`_. It indicates the packet is
116*4882a593Smuzhiyundropped in the IP sending path and due to kernel internal reasons.
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28
119*4882a593Smuzhiyun
120*4882a593Smuzhiyun* IpOutNoRoutes
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunDefined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is
123*4882a593Smuzhiyundropped in the IP sending path and no route is found for it.
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunICMP counters
128*4882a593Smuzhiyun=============
129*4882a593Smuzhiyun* IcmpInMsgs and IcmpOutMsgs
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunDefined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41
134*4882a593Smuzhiyun.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43
135*4882a593Smuzhiyun
136*4882a593SmuzhiyunAs mentioned in the RFC1213, these two counters include errors, they
137*4882a593Smuzhiyunwould be increased even if the ICMP packet has an invalid type. The
138*4882a593SmuzhiyunICMP output path will check the header of a raw socket, so the
139*4882a593SmuzhiyunIcmpOutMsgs would still be updated if the IP header is constructed by
140*4882a593Smuzhiyuna userspace program.
141*4882a593Smuzhiyun
142*4882a593Smuzhiyun* ICMP named types
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun| These counters include most of common ICMP types, they are:
145*4882a593Smuzhiyun| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_
146*4882a593Smuzhiyun| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_
147*4882a593Smuzhiyun| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_
148*4882a593Smuzhiyun| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_
149*4882a593Smuzhiyun| IcmpInRedirects: `RFC1213 icmpInRedirects`_
150*4882a593Smuzhiyun| IcmpInEchos: `RFC1213 icmpInEchos`_
151*4882a593Smuzhiyun| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_
152*4882a593Smuzhiyun| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_
153*4882a593Smuzhiyun| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_
154*4882a593Smuzhiyun| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_
155*4882a593Smuzhiyun| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_
156*4882a593Smuzhiyun| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_
157*4882a593Smuzhiyun| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_
158*4882a593Smuzhiyun| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_
159*4882a593Smuzhiyun| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_
160*4882a593Smuzhiyun| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_
161*4882a593Smuzhiyun| IcmpOutEchos: `RFC1213 icmpOutEchos`_
162*4882a593Smuzhiyun| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_
163*4882a593Smuzhiyun| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_
164*4882a593Smuzhiyun| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_
165*4882a593Smuzhiyun| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_
166*4882a593Smuzhiyun| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_
167*4882a593Smuzhiyun
168*4882a593Smuzhiyun.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41
169*4882a593Smuzhiyun.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41
170*4882a593Smuzhiyun.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42
171*4882a593Smuzhiyun.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42
172*4882a593Smuzhiyun.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42
173*4882a593Smuzhiyun.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42
174*4882a593Smuzhiyun.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42
175*4882a593Smuzhiyun.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42
176*4882a593Smuzhiyun.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43
177*4882a593Smuzhiyun.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43
178*4882a593Smuzhiyun.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44
181*4882a593Smuzhiyun.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44
182*4882a593Smuzhiyun.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44
183*4882a593Smuzhiyun.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44
184*4882a593Smuzhiyun.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44
185*4882a593Smuzhiyun.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45
186*4882a593Smuzhiyun.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45
187*4882a593Smuzhiyun.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45
188*4882a593Smuzhiyun.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45
189*4882a593Smuzhiyun.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45
190*4882a593Smuzhiyun.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46
191*4882a593Smuzhiyun
192*4882a593SmuzhiyunEvery ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP
193*4882a593SmuzhiyunEcho packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are
194*4882a593Smuzhiyunstraightforward. The 'In' counter means kernel receives such a packet
195*4882a593Smuzhiyunand the 'Out' counter means kernel sends such a packet.
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun* ICMP numeric types
198*4882a593Smuzhiyun
199*4882a593SmuzhiyunThey are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the
200*4882a593SmuzhiyunICMP type number. These counters track all kinds of ICMP packets. The
201*4882a593SmuzhiyunICMP type number definition could be found in the `ICMP parameters`_
202*4882a593Smuzhiyundocument.
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
205*4882a593Smuzhiyun
206*4882a593SmuzhiyunFor example, if the Linux kernel sends an ICMP Echo packet, the
207*4882a593SmuzhiyunIcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply
208*4882a593Smuzhiyunpacket, IcmpMsgInType0 would increase 1.
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun* IcmpInCsumErrors
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunThis counter indicates the checksum of the ICMP packet is
213*4882a593Smuzhiyunwrong. Kernel verifies the checksum after updating the IcmpInMsgs and
214*4882a593Smuzhiyunbefore updating IcmpMsgInType[N]. If a packet has bad checksum, the
215*4882a593SmuzhiyunIcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated.
216*4882a593Smuzhiyun
217*4882a593Smuzhiyun* IcmpInErrors and IcmpOutErrors
218*4882a593Smuzhiyun
219*4882a593SmuzhiyunDefined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_
220*4882a593Smuzhiyun
221*4882a593Smuzhiyun.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41
222*4882a593Smuzhiyun.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunWhen an error occurs in the ICMP packet handler path, these two
225*4882a593Smuzhiyuncounters would be updated. The receiving packet path use IcmpInErrors
226*4882a593Smuzhiyunand the sending packet path use IcmpOutErrors. When IcmpInCsumErrors
227*4882a593Smuzhiyunis increased, IcmpInErrors would always be increased too.
228*4882a593Smuzhiyun
229*4882a593Smuzhiyunrelationship of the ICMP counters
230*4882a593Smuzhiyun---------------------------------
231*4882a593SmuzhiyunThe sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they
232*4882a593Smuzhiyunare updated at the same time. The sum of IcmpMsgInType[N] plus
233*4882a593SmuzhiyunIcmpInErrors should be equal or larger than IcmpInMsgs. When kernel
234*4882a593Smuzhiyunreceives an ICMP packet, kernel follows below logic:
235*4882a593Smuzhiyun
236*4882a593Smuzhiyun1. increase IcmpInMsgs
237*4882a593Smuzhiyun2. if has any error, update IcmpInErrors and finish the process
238*4882a593Smuzhiyun3. update IcmpMsgOutType[N]
239*4882a593Smuzhiyun4. handle the packet depending on the type, if has any error, update
240*4882a593Smuzhiyun   IcmpInErrors and finish the process
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunSo if all errors occur in step (2), IcmpInMsgs should be equal to the
243*4882a593Smuzhiyunsum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in
244*4882a593Smuzhiyunstep (4), IcmpInMsgs should be equal to the sum of
245*4882a593SmuzhiyunIcmpMsgOutType[N]. If the errors occur in both step (2) and step (4),
246*4882a593SmuzhiyunIcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
247*4882a593SmuzhiyunIcmpInErrors.
248*4882a593Smuzhiyun
249*4882a593SmuzhiyunGeneral TCP counters
250*4882a593Smuzhiyun====================
251*4882a593Smuzhiyun* TcpInSegs
252*4882a593Smuzhiyun
253*4882a593SmuzhiyunDefined in `RFC1213 tcpInSegs`_
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
256*4882a593Smuzhiyun
257*4882a593SmuzhiyunThe number of packets received by the TCP layer. As mentioned in
258*4882a593SmuzhiyunRFC1213, it includes the packets received in error, such as checksum
259*4882a593Smuzhiyunerror, invalid TCP header and so on. Only one error won't be included:
260*4882a593Smuzhiyunif the layer 2 destination address is not the NIC's layer 2
261*4882a593Smuzhiyunaddress. It might happen if the packet is a multicast or broadcast
262*4882a593Smuzhiyunpacket, or the NIC is in promiscuous mode. In these situations, the
263*4882a593Smuzhiyunpackets would be delivered to the TCP layer, but the TCP layer will discard
264*4882a593Smuzhiyunthese packets before increasing TcpInSegs. The TcpInSegs counter
265*4882a593Smuzhiyunisn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
266*4882a593Smuzhiyuncounter would only increase 1.
267*4882a593Smuzhiyun
268*4882a593Smuzhiyun* TcpOutSegs
269*4882a593Smuzhiyun
270*4882a593SmuzhiyunDefined in `RFC1213 tcpOutSegs`_
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunThe number of packets sent by the TCP layer. As mentioned in RFC1213,
275*4882a593Smuzhiyunit excludes the retransmitted packets. But it includes the SYN, ACK
276*4882a593Smuzhiyunand RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of
277*4882a593SmuzhiyunGSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
278*4882a593Smuzhiyunincrease 2.
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun* TcpActiveOpens
281*4882a593Smuzhiyun
282*4882a593SmuzhiyunDefined in `RFC1213 tcpActiveOpens`_
283*4882a593Smuzhiyun
284*4882a593Smuzhiyun.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
285*4882a593Smuzhiyun
286*4882a593SmuzhiyunIt means the TCP layer sends a SYN, and come into the SYN-SENT
287*4882a593Smuzhiyunstate. Every time TcpActiveOpens increases 1, TcpOutSegs should always
288*4882a593Smuzhiyunincrease 1.
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun* TcpPassiveOpens
291*4882a593Smuzhiyun
292*4882a593SmuzhiyunDefined in `RFC1213 tcpPassiveOpens`_
293*4882a593Smuzhiyun
294*4882a593Smuzhiyun.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
295*4882a593Smuzhiyun
296*4882a593SmuzhiyunIt means the TCP layer receives a SYN, replies a SYN+ACK, come into
297*4882a593Smuzhiyunthe SYN-RCVD state.
298*4882a593Smuzhiyun
299*4882a593Smuzhiyun* TcpExtTCPRcvCoalesce
300*4882a593Smuzhiyun
301*4882a593SmuzhiyunWhen packets are received by the TCP layer and are not be read by the
302*4882a593Smuzhiyunapplication, the TCP layer will try to merge them. This counter
303*4882a593Smuzhiyunindicate how many packets are merged in such situation. If GRO is
304*4882a593Smuzhiyunenabled, lots of packets would be merged by GRO, these packets
305*4882a593Smuzhiyunwouldn't be counted to TcpExtTCPRcvCoalesce.
306*4882a593Smuzhiyun
307*4882a593Smuzhiyun* TcpExtTCPAutoCorking
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunWhen sending packets, the TCP layer will try to merge small packets to
310*4882a593Smuzhiyuna bigger one. This counter increase 1 for every packet merged in such
311*4882a593Smuzhiyunsituation. Please refer to the LWN article for more details:
312*4882a593Smuzhiyunhttps://lwn.net/Articles/576263/
313*4882a593Smuzhiyun
314*4882a593Smuzhiyun* TcpExtTCPOrigDataSent
315*4882a593Smuzhiyun
316*4882a593SmuzhiyunThis counter is explained by `kernel commit f19c29e3e391`_, I pasted the
317*4882a593Smuzhiyunexplaination below::
318*4882a593Smuzhiyun
319*4882a593Smuzhiyun  TCPOrigDataSent: number of outgoing packets with original data (excluding
320*4882a593Smuzhiyun  retransmission but including data-in-SYN). This counter is different from
321*4882a593Smuzhiyun  TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
322*4882a593Smuzhiyun  more useful to track the TCP retransmission rate.
323*4882a593Smuzhiyun
324*4882a593Smuzhiyun* TCPSynRetrans
325*4882a593Smuzhiyun
326*4882a593SmuzhiyunThis counter is explained by `kernel commit f19c29e3e391`_, I pasted the
327*4882a593Smuzhiyunexplaination below::
328*4882a593Smuzhiyun
329*4882a593Smuzhiyun  TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
330*4882a593Smuzhiyun  retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
331*4882a593Smuzhiyun
332*4882a593Smuzhiyun* TCPFastOpenActiveFail
333*4882a593Smuzhiyun
334*4882a593SmuzhiyunThis counter is explained by `kernel commit f19c29e3e391`_, I pasted the
335*4882a593Smuzhiyunexplaination below::
336*4882a593Smuzhiyun
337*4882a593Smuzhiyun  TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
338*4882a593Smuzhiyun  the remote does not accept it or the attempts timed out.
339*4882a593Smuzhiyun
340*4882a593Smuzhiyun.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
341*4882a593Smuzhiyun
342*4882a593Smuzhiyun* TcpExtListenOverflows and TcpExtListenDrops
343*4882a593Smuzhiyun
344*4882a593SmuzhiyunWhen kernel receives a SYN from a client, and if the TCP accept queue
345*4882a593Smuzhiyunis full, kernel will drop the SYN and add 1 to TcpExtListenOverflows.
346*4882a593SmuzhiyunAt the same time kernel will also add 1 to TcpExtListenDrops. When a
347*4882a593SmuzhiyunTCP socket is in LISTEN state, and kernel need to drop a packet,
348*4882a593Smuzhiyunkernel would always add 1 to TcpExtListenDrops. So increase
349*4882a593SmuzhiyunTcpExtListenOverflows would let TcpExtListenDrops increasing at the
350*4882a593Smuzhiyunsame time, but TcpExtListenDrops would also increase without
351*4882a593SmuzhiyunTcpExtListenOverflows increasing, e.g. a memory allocation fail would
352*4882a593Smuzhiyunalso let TcpExtListenDrops increase.
353*4882a593Smuzhiyun
354*4882a593SmuzhiyunNote: The above explanation is based on kernel 4.10 or above version, on
355*4882a593Smuzhiyunan old kernel, the TCP stack has different behavior when TCP accept
356*4882a593Smuzhiyunqueue is full. On the old kernel, TCP stack won't drop the SYN, it
357*4882a593Smuzhiyunwould complete the 3-way handshake. As the accept queue is full, TCP
358*4882a593Smuzhiyunstack will keep the socket in the TCP half-open queue. As it is in the
359*4882a593Smuzhiyunhalf open queue, TCP stack will send SYN+ACK on an exponential backoff
360*4882a593Smuzhiyuntimer, after client replies ACK, TCP stack checks whether the accept
361*4882a593Smuzhiyunqueue is still full, if it is not full, moves the socket to the accept
362*4882a593Smuzhiyunqueue, if it is full, keeps the socket in the half-open queue, at next
363*4882a593Smuzhiyuntime client replies ACK, this socket will get another chance to move
364*4882a593Smuzhiyunto the accept queue.
365*4882a593Smuzhiyun
366*4882a593Smuzhiyun
367*4882a593SmuzhiyunTCP Fast Open
368*4882a593Smuzhiyun=============
369*4882a593Smuzhiyun* TcpEstabResets
370*4882a593Smuzhiyun
371*4882a593SmuzhiyunDefined in `RFC1213 tcpEstabResets`_.
372*4882a593Smuzhiyun
373*4882a593Smuzhiyun.. _RFC1213 tcpEstabResets: https://tools.ietf.org/html/rfc1213#page-48
374*4882a593Smuzhiyun
375*4882a593Smuzhiyun* TcpAttemptFails
376*4882a593Smuzhiyun
377*4882a593SmuzhiyunDefined in `RFC1213 tcpAttemptFails`_.
378*4882a593Smuzhiyun
379*4882a593Smuzhiyun.. _RFC1213 tcpAttemptFails: https://tools.ietf.org/html/rfc1213#page-48
380*4882a593Smuzhiyun
381*4882a593Smuzhiyun* TcpOutRsts
382*4882a593Smuzhiyun
383*4882a593SmuzhiyunDefined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates
384*4882a593Smuzhiyunthe 'segments sent containing the RST flag', but in linux kernel, this
385*4882a593Smuzhiyuncouner indicates the segments kerenl tried to send. The sending
386*4882a593Smuzhiyunprocess might be failed due to some errors (e.g. memory alloc failed).
387*4882a593Smuzhiyun
388*4882a593Smuzhiyun.. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
389*4882a593Smuzhiyun
390*4882a593Smuzhiyun* TcpExtTCPSpuriousRtxHostQueues
391*4882a593Smuzhiyun
392*4882a593SmuzhiyunWhen the TCP stack wants to retransmit a packet, and finds that packet
393*4882a593Smuzhiyunis not lost in the network, but the packet is not sent yet, the TCP
394*4882a593Smuzhiyunstack would give up the retransmission and update this counter. It
395*4882a593Smuzhiyunmight happen if a packet stays too long time in a qdisc or driver
396*4882a593Smuzhiyunqueue.
397*4882a593Smuzhiyun
398*4882a593Smuzhiyun* TcpEstabResets
399*4882a593Smuzhiyun
400*4882a593SmuzhiyunThe socket receives a RST packet in Establish or CloseWait state.
401*4882a593Smuzhiyun
402*4882a593Smuzhiyun* TcpExtTCPKeepAlive
403*4882a593Smuzhiyun
404*4882a593SmuzhiyunThis counter indicates many keepalive packets were sent. The keepalive
405*4882a593Smuzhiyunwon't be enabled by default. A userspace program could enable it by
406*4882a593Smuzhiyunsetting the SO_KEEPALIVE socket option.
407*4882a593Smuzhiyun
408*4882a593Smuzhiyun* TcpExtTCPSpuriousRTOs
409*4882a593Smuzhiyun
410*4882a593SmuzhiyunThe spurious retransmission timeout detected by the `F-RTO`_
411*4882a593Smuzhiyunalgorithm.
412*4882a593Smuzhiyun
413*4882a593Smuzhiyun.. _F-RTO: https://tools.ietf.org/html/rfc5682
414*4882a593Smuzhiyun
415*4882a593SmuzhiyunTCP Fast Path
416*4882a593Smuzhiyun=============
417*4882a593SmuzhiyunWhen kernel receives a TCP packet, it has two paths to handler the
418*4882a593Smuzhiyunpacket, one is fast path, another is slow path. The comment in kernel
419*4882a593Smuzhiyuncode provides a good explanation of them, I pasted them below::
420*4882a593Smuzhiyun
421*4882a593Smuzhiyun  It is split into a fast path and a slow path. The fast path is
422*4882a593Smuzhiyun  disabled when:
423*4882a593Smuzhiyun
424*4882a593Smuzhiyun  - A zero window was announced from us
425*4882a593Smuzhiyun  - zero window probing
426*4882a593Smuzhiyun    is only handled properly on the slow path.
427*4882a593Smuzhiyun  - Out of order segments arrived.
428*4882a593Smuzhiyun  - Urgent data is expected.
429*4882a593Smuzhiyun  - There is no buffer space left
430*4882a593Smuzhiyun  - Unexpected TCP flags/window values/header lengths are received
431*4882a593Smuzhiyun    (detected by checking the TCP header against pred_flags)
432*4882a593Smuzhiyun  - Data is sent in both directions. The fast path only supports pure senders
433*4882a593Smuzhiyun    or pure receivers (this means either the sequence number or the ack
434*4882a593Smuzhiyun    value must stay constant)
435*4882a593Smuzhiyun  - Unexpected TCP option.
436*4882a593Smuzhiyun
437*4882a593SmuzhiyunKernel will try to use fast path unless any of the above conditions
438*4882a593Smuzhiyunare satisfied. If the packets are out of order, kernel will handle
439*4882a593Smuzhiyunthem in slow path, which means the performance might be not very
440*4882a593Smuzhiyungood. Kernel would also come into slow path if the "Delayed ack" is
441*4882a593Smuzhiyunused, because when using "Delayed ack", the data is sent in both
442*4882a593Smuzhiyundirections. When the TCP window scale option is not used, kernel will
443*4882a593Smuzhiyuntry to enable fast path immediately when the connection comes into the
444*4882a593Smuzhiyunestablished state, but if the TCP window scale option is used, kernel
445*4882a593Smuzhiyunwill disable the fast path at first, and try to enable it after kernel
446*4882a593Smuzhiyunreceives packets.
447*4882a593Smuzhiyun
448*4882a593Smuzhiyun* TcpExtTCPPureAcks and TcpExtTCPHPAcks
449*4882a593Smuzhiyun
450*4882a593SmuzhiyunIf a packet set ACK flag and has no data, it is a pure ACK packet, if
451*4882a593Smuzhiyunkernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
452*4882a593Smuzhiyunif kernel handles it in the slow path, TcpExtTCPPureAcks will
453*4882a593Smuzhiyunincrease 1.
454*4882a593Smuzhiyun
455*4882a593Smuzhiyun* TcpExtTCPHPHits
456*4882a593Smuzhiyun
457*4882a593SmuzhiyunIf a TCP packet has data (which means it is not a pure ACK packet),
458*4882a593Smuzhiyunand this packet is handled in the fast path, TcpExtTCPHPHits will
459*4882a593Smuzhiyunincrease 1.
460*4882a593Smuzhiyun
461*4882a593Smuzhiyun
462*4882a593SmuzhiyunTCP abort
463*4882a593Smuzhiyun=========
464*4882a593Smuzhiyun* TcpExtTCPAbortOnData
465*4882a593Smuzhiyun
466*4882a593SmuzhiyunIt means TCP layer has data in flight, but need to close the
467*4882a593Smuzhiyunconnection. So TCP layer sends a RST to the other side, indicate the
468*4882a593Smuzhiyunconnection is not closed very graceful. An easy way to increase this
469*4882a593Smuzhiyuncounter is using the SO_LINGER option. Please refer to the SO_LINGER
470*4882a593Smuzhiyunsection of the `socket man page`_:
471*4882a593Smuzhiyun
472*4882a593Smuzhiyun.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html
473*4882a593Smuzhiyun
474*4882a593SmuzhiyunBy default, when an application closes a connection, the close function
475*4882a593Smuzhiyunwill return immediately and kernel will try to send the in-flight data
476*4882a593Smuzhiyunasync. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
477*4882a593Smuzhiyunto a positive number, the close function won't return immediately, but
478*4882a593Smuzhiyunwait for the in-flight data are acked by the other side, the max wait
479*4882a593Smuzhiyuntime is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
480*4882a593Smuzhiyunwhen the application closes a connection, kernel will send a RST
481*4882a593Smuzhiyunimmediately and increase the TcpExtTCPAbortOnData counter.
482*4882a593Smuzhiyun
483*4882a593Smuzhiyun* TcpExtTCPAbortOnClose
484*4882a593Smuzhiyun
485*4882a593SmuzhiyunThis counter means the application has unread data in the TCP layer when
486*4882a593Smuzhiyunthe application wants to close the TCP connection. In such a situation,
487*4882a593Smuzhiyunkernel will send a RST to the other side of the TCP connection.
488*4882a593Smuzhiyun
489*4882a593Smuzhiyun* TcpExtTCPAbortOnMemory
490*4882a593Smuzhiyun
491*4882a593SmuzhiyunWhen an application closes a TCP connection, kernel still need to track
492*4882a593Smuzhiyunthe connection, let it complete the TCP disconnect process. E.g. an
493*4882a593Smuzhiyunapp calls the close method of a socket, kernel sends fin to the other
494*4882a593Smuzhiyunside of the connection, then the app has no relationship with the
495*4882a593Smuzhiyunsocket any more, but kernel need to keep the socket, this socket
496*4882a593Smuzhiyunbecomes an orphan socket, kernel waits for the reply of the other side,
497*4882a593Smuzhiyunand would come to the TIME_WAIT state finally. When kernel has no
498*4882a593Smuzhiyunenough memory to keep the orphan socket, kernel would send an RST to
499*4882a593Smuzhiyunthe other side, and delete the socket, in such situation, kernel will
500*4882a593Smuzhiyunincrease 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
501*4882a593SmuzhiyunTcpExtTCPAbortOnMemory:
502*4882a593Smuzhiyun
503*4882a593Smuzhiyun1. the memory used by the TCP protocol is higher than the third value of
504*4882a593Smuzhiyunthe tcp_mem. Please refer the tcp_mem section in the `TCP man page`_:
505*4882a593Smuzhiyun
506*4882a593Smuzhiyun.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html
507*4882a593Smuzhiyun
508*4882a593Smuzhiyun2. the orphan socket count is higher than net.ipv4.tcp_max_orphans
509*4882a593Smuzhiyun
510*4882a593Smuzhiyun
511*4882a593Smuzhiyun* TcpExtTCPAbortOnTimeout
512*4882a593Smuzhiyun
513*4882a593SmuzhiyunThis counter will increase when any of the TCP timers expire. In such
514*4882a593Smuzhiyunsituation, kernel won't send RST, just give up the connection.
515*4882a593Smuzhiyun
516*4882a593Smuzhiyun* TcpExtTCPAbortOnLinger
517*4882a593Smuzhiyun
518*4882a593SmuzhiyunWhen a TCP connection comes into FIN_WAIT_2 state, instead of waiting
519*4882a593Smuzhiyunfor the fin packet from the other side, kernel could send a RST and
520*4882a593Smuzhiyundelete the socket immediately. This is not the default behavior of
521*4882a593SmuzhiyunLinux kernel TCP stack. By configuring the TCP_LINGER2 socket option,
522*4882a593Smuzhiyunyou could let kernel follow this behavior.
523*4882a593Smuzhiyun
524*4882a593Smuzhiyun* TcpExtTCPAbortFailed
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunThe kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is
527*4882a593Smuzhiyunsatisfied. If an internal error occurs during this process,
528*4882a593SmuzhiyunTcpExtTCPAbortFailed will be increased.
529*4882a593Smuzhiyun
530*4882a593Smuzhiyun.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50
531*4882a593Smuzhiyun
532*4882a593SmuzhiyunTCP Hybrid Slow Start
533*4882a593Smuzhiyun=====================
534*4882a593SmuzhiyunThe Hybrid Slow Start algorithm is an enhancement of the traditional
535*4882a593SmuzhiyunTCP congestion window Slow Start algorithm. It uses two pieces of
536*4882a593Smuzhiyuninformation to detect whether the max bandwidth of the TCP path is
537*4882a593Smuzhiyunapproached. The two pieces of information are ACK train length and
538*4882a593Smuzhiyunincrease in packet delay. For detail information, please refer the
539*4882a593Smuzhiyun`Hybrid Slow Start paper`_. Either ACK train length or packet delay
540*4882a593Smuzhiyunhits a specific threshold, the congestion control algorithm will come
541*4882a593Smuzhiyuninto the Congestion Avoidance state. Until v4.20, two congestion
542*4882a593Smuzhiyuncontrol algorithms are using Hybrid Slow Start, they are cubic (the
543*4882a593Smuzhiyundefault congestion control algorithm) and cdg. Four snmp counters
544*4882a593Smuzhiyunrelate with the Hybrid Slow Start algorithm.
545*4882a593Smuzhiyun
546*4882a593Smuzhiyun.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf
547*4882a593Smuzhiyun
548*4882a593Smuzhiyun* TcpExtTCPHystartTrainDetect
549*4882a593Smuzhiyun
550*4882a593SmuzhiyunHow many times the ACK train length threshold is detected
551*4882a593Smuzhiyun
552*4882a593Smuzhiyun* TcpExtTCPHystartTrainCwnd
553*4882a593Smuzhiyun
554*4882a593SmuzhiyunThe sum of CWND detected by ACK train length. Dividing this value by
555*4882a593SmuzhiyunTcpExtTCPHystartTrainDetect is the average CWND which detected by the
556*4882a593SmuzhiyunACK train length.
557*4882a593Smuzhiyun
558*4882a593Smuzhiyun* TcpExtTCPHystartDelayDetect
559*4882a593Smuzhiyun
560*4882a593SmuzhiyunHow many times the packet delay threshold is detected.
561*4882a593Smuzhiyun
562*4882a593Smuzhiyun* TcpExtTCPHystartDelayCwnd
563*4882a593Smuzhiyun
564*4882a593SmuzhiyunThe sum of CWND detected by packet delay. Dividing this value by
565*4882a593SmuzhiyunTcpExtTCPHystartDelayDetect is the average CWND which detected by the
566*4882a593Smuzhiyunpacket delay.
567*4882a593Smuzhiyun
568*4882a593SmuzhiyunTCP retransmission and congestion control
569*4882a593Smuzhiyun=========================================
570*4882a593SmuzhiyunThe TCP protocol has two retransmission mechanisms: SACK and fast
571*4882a593Smuzhiyunrecovery. They are exclusive with each other. When SACK is enabled,
572*4882a593Smuzhiyunthe kernel TCP stack would use SACK, or kernel would use fast
573*4882a593Smuzhiyunrecovery. The SACK is a TCP option, which is defined in `RFC2018`_,
574*4882a593Smuzhiyunthe fast recovery is defined in `RFC6582`_, which is also called
575*4882a593Smuzhiyun'Reno'.
576*4882a593Smuzhiyun
577*4882a593SmuzhiyunThe TCP congestion control is a big and complex topic. To understand
578*4882a593Smuzhiyunthe related snmp counter, we need to know the states of the congestion
579*4882a593Smuzhiyuncontrol state machine. There are 5 states: Open, Disorder, CWR,
580*4882a593SmuzhiyunRecovery and Loss. For details about these states, please refer page 5
581*4882a593Smuzhiyunand page 6 of this document:
582*4882a593Smuzhiyunhttps://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf
583*4882a593Smuzhiyun
584*4882a593Smuzhiyun.. _RFC2018: https://tools.ietf.org/html/rfc2018
585*4882a593Smuzhiyun.. _RFC6582: https://tools.ietf.org/html/rfc6582
586*4882a593Smuzhiyun
587*4882a593Smuzhiyun* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery
588*4882a593Smuzhiyun
589*4882a593SmuzhiyunWhen the congestion control comes into Recovery state, if sack is
590*4882a593Smuzhiyunused, TcpExtTCPSackRecovery increases 1, if sack is not used,
591*4882a593SmuzhiyunTcpExtTCPRenoRecovery increases 1. These two counters mean the TCP
592*4882a593Smuzhiyunstack begins to retransmit the lost packets.
593*4882a593Smuzhiyun
594*4882a593Smuzhiyun* TcpExtTCPSACKReneging
595*4882a593Smuzhiyun
596*4882a593SmuzhiyunA packet was acknowledged by SACK, but the receiver has dropped this
597*4882a593Smuzhiyunpacket, so the sender needs to retransmit this packet. In this
598*4882a593Smuzhiyunsituation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver
599*4882a593Smuzhiyuncould drop a packet which has been acknowledged by SACK, although it is
600*4882a593Smuzhiyununusual, it is allowed by the TCP protocol. The sender doesn't really
601*4882a593Smuzhiyunknow what happened on the receiver side. The sender just waits until
602*4882a593Smuzhiyunthe RTO expires for this packet, then the sender assumes this packet
603*4882a593Smuzhiyunhas been dropped by the receiver.
604*4882a593Smuzhiyun
605*4882a593Smuzhiyun* TcpExtTCPRenoReorder
606*4882a593Smuzhiyun
607*4882a593SmuzhiyunThe reorder packet is detected by fast recovery. It would only be used
608*4882a593Smuzhiyunif SACK is disabled. The fast recovery algorithm detects recorder by
609*4882a593Smuzhiyunthe duplicate ACK number. E.g., if retransmission is triggered, and
610*4882a593Smuzhiyunthe original retransmitted packet is not lost, it is just out of
611*4882a593Smuzhiyunorder, the receiver would acknowledge multiple times, one for the
612*4882a593Smuzhiyunretransmitted packet, another for the arriving of the original out of
613*4882a593Smuzhiyunorder packet. Thus the sender would find more ACks than its
614*4882a593Smuzhiyunexpectation, and the sender knows out of order occurs.
615*4882a593Smuzhiyun
616*4882a593Smuzhiyun* TcpExtTCPTSReorder
617*4882a593Smuzhiyun
618*4882a593SmuzhiyunThe reorder packet is detected when a hole is filled. E.g., assume the
619*4882a593Smuzhiyunsender sends packet 1,2,3,4,5, and the receiving order is
620*4882a593Smuzhiyun1,2,4,5,3. When the sender receives the ACK of packet 3 (which will
621*4882a593Smuzhiyunfill the hole), two conditions will let TcpExtTCPTSReorder increase
622*4882a593Smuzhiyun1: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet
623*4882a593Smuzhiyun3 is retransmitted but the timestamp of the packet 3's ACK is earlier
624*4882a593Smuzhiyunthan the retransmission timestamp.
625*4882a593Smuzhiyun
626*4882a593Smuzhiyun* TcpExtTCPSACKReorder
627*4882a593Smuzhiyun
628*4882a593SmuzhiyunThe reorder packet detected by SACK. The SACK has two methods to
629*4882a593Smuzhiyundetect reorder: (1) DSACK is received by the sender. It means the
630*4882a593Smuzhiyunsender sends the same packet more than one times. And the only reason
631*4882a593Smuzhiyunis the sender believes an out of order packet is lost so it sends the
632*4882a593Smuzhiyunpacket again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and
633*4882a593Smuzhiyunthe sender has received SACKs for packet 2 and 5, now the sender
634*4882a593Smuzhiyunreceives SACK for packet 4 and the sender doesn't retransmit the
635*4882a593Smuzhiyunpacket yet, the sender would know packet 4 is out of order. The TCP
636*4882a593Smuzhiyunstack of kernel will increase TcpExtTCPSACKReorder for both of the
637*4882a593Smuzhiyunabove scenarios.
638*4882a593Smuzhiyun
639*4882a593Smuzhiyun* TcpExtTCPSlowStartRetrans
640*4882a593Smuzhiyun
641*4882a593SmuzhiyunThe TCP stack wants to retransmit a packet and the congestion control
642*4882a593Smuzhiyunstate is 'Loss'.
643*4882a593Smuzhiyun
644*4882a593Smuzhiyun* TcpExtTCPFastRetrans
645*4882a593Smuzhiyun
646*4882a593SmuzhiyunThe TCP stack wants to retransmit a packet and the congestion control
647*4882a593Smuzhiyunstate is not 'Loss'.
648*4882a593Smuzhiyun
649*4882a593Smuzhiyun* TcpExtTCPLostRetransmit
650*4882a593Smuzhiyun
651*4882a593SmuzhiyunA SACK points out that a retransmission packet is lost again.
652*4882a593Smuzhiyun
653*4882a593Smuzhiyun* TcpExtTCPRetransFail
654*4882a593Smuzhiyun
655*4882a593SmuzhiyunThe TCP stack tries to deliver a retransmission packet to lower layers
656*4882a593Smuzhiyunbut the lower layers return an error.
657*4882a593Smuzhiyun
658*4882a593Smuzhiyun* TcpExtTCPSynRetrans
659*4882a593Smuzhiyun
660*4882a593SmuzhiyunThe TCP stack retransmits a SYN packet.
661*4882a593Smuzhiyun
662*4882a593SmuzhiyunDSACK
663*4882a593Smuzhiyun=====
664*4882a593SmuzhiyunThe DSACK is defined in `RFC2883`_. The receiver uses DSACK to report
665*4882a593Smuzhiyunduplicate packets to the sender. There are two kinds of
666*4882a593Smuzhiyunduplications: (1) a packet which has been acknowledged is
667*4882a593Smuzhiyunduplicate. (2) an out of order packet is duplicate. The TCP stack
668*4882a593Smuzhiyuncounts these two kinds of duplications on both receiver side and
669*4882a593Smuzhiyunsender side.
670*4882a593Smuzhiyun
671*4882a593Smuzhiyun.. _RFC2883 : https://tools.ietf.org/html/rfc2883
672*4882a593Smuzhiyun
673*4882a593Smuzhiyun* TcpExtTCPDSACKOldSent
674*4882a593Smuzhiyun
675*4882a593SmuzhiyunThe TCP stack receives a duplicate packet which has been acked, so it
676*4882a593Smuzhiyunsends a DSACK to the sender.
677*4882a593Smuzhiyun
678*4882a593Smuzhiyun* TcpExtTCPDSACKOfoSent
679*4882a593Smuzhiyun
680*4882a593SmuzhiyunThe TCP stack receives an out of order duplicate packet, so it sends a
681*4882a593SmuzhiyunDSACK to the sender.
682*4882a593Smuzhiyun
683*4882a593Smuzhiyun* TcpExtTCPDSACKRecv
684*4882a593Smuzhiyun
685*4882a593SmuzhiyunThe TCP stack receives a DSACK, which indicates an acknowledged
686*4882a593Smuzhiyunduplicate packet is received.
687*4882a593Smuzhiyun
688*4882a593Smuzhiyun* TcpExtTCPDSACKOfoRecv
689*4882a593Smuzhiyun
690*4882a593SmuzhiyunThe TCP stack receives a DSACK, which indicate an out of order
691*4882a593Smuzhiyunduplicate packet is received.
692*4882a593Smuzhiyun
693*4882a593Smuzhiyuninvalid SACK and DSACK
694*4882a593Smuzhiyun======================
695*4882a593SmuzhiyunWhen a SACK (or DSACK) block is invalid, a corresponding counter would
696*4882a593Smuzhiyunbe updated. The validation method is base on the start/end sequence
697*4882a593Smuzhiyunnumber of the SACK block. For more details, please refer the comment
698*4882a593Smuzhiyunof the function tcp_is_sackblock_valid in the kernel source code. A
699*4882a593SmuzhiyunSACK option could have up to 4 blocks, they are checked
700*4882a593Smuzhiyunindividually. E.g., if 3 blocks of a SACk is invalid, the
701*4882a593Smuzhiyuncorresponding counter would be updated 3 times. The comment of the
702*4882a593Smuzhiyun`Add counters for discarded SACK blocks`_ patch has additional
703*4882a593Smuzhiyunexplaination:
704*4882a593Smuzhiyun
705*4882a593Smuzhiyun.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
706*4882a593Smuzhiyun
707*4882a593Smuzhiyun* TcpExtTCPSACKDiscard
708*4882a593Smuzhiyun
709*4882a593SmuzhiyunThis counter indicates how many SACK blocks are invalid. If the invalid
710*4882a593SmuzhiyunSACK block is caused by ACK recording, the TCP stack will only ignore
711*4882a593Smuzhiyunit and won't update this counter.
712*4882a593Smuzhiyun
713*4882a593Smuzhiyun* TcpExtTCPDSACKIgnoredOld and TcpExtTCPDSACKIgnoredNoUndo
714*4882a593Smuzhiyun
715*4882a593SmuzhiyunWhen a DSACK block is invalid, one of these two counters would be
716*4882a593Smuzhiyunupdated. Which counter will be updated depends on the undo_marker flag
717*4882a593Smuzhiyunof the TCP socket. If the undo_marker is not set, the TCP stack isn't
718*4882a593Smuzhiyunlikely to re-transmit any packets, and we still receive an invalid
719*4882a593SmuzhiyunDSACK block, the reason might be that the packet is duplicated in the
720*4882a593Smuzhiyunmiddle of the network. In such scenario, TcpExtTCPDSACKIgnoredNoUndo
721*4882a593Smuzhiyunwill be updated. If the undo_marker is set, TcpExtTCPDSACKIgnoredOld
722*4882a593Smuzhiyunwill be updated. As implied in its name, it might be an old packet.
723*4882a593Smuzhiyun
724*4882a593SmuzhiyunSACK shift
725*4882a593Smuzhiyun==========
726*4882a593SmuzhiyunThe linux networking stack stores data in sk_buff struct (skb for
727*4882a593Smuzhiyunshort). If a SACK block acrosses multiple skb, the TCP stack will try
728*4882a593Smuzhiyunto re-arrange data in these skb. E.g. if a SACK block acknowledges seq
729*4882a593Smuzhiyun10 to 15, skb1 has seq 10 to 13, skb2 has seq 14 to 20. The seq 14 and
730*4882a593Smuzhiyun15 in skb2 would be moved to skb1. This operation is 'shift'. If a
731*4882a593SmuzhiyunSACK block acknowledges seq 10 to 20, skb1 has seq 10 to 13, skb2 has
732*4882a593Smuzhiyunseq 14 to 20. All data in skb2 will be moved to skb1, and skb2 will be
733*4882a593Smuzhiyundiscard, this operation is 'merge'.
734*4882a593Smuzhiyun
735*4882a593Smuzhiyun* TcpExtTCPSackShifted
736*4882a593Smuzhiyun
737*4882a593SmuzhiyunA skb is shifted
738*4882a593Smuzhiyun
739*4882a593Smuzhiyun* TcpExtTCPSackMerged
740*4882a593Smuzhiyun
741*4882a593SmuzhiyunA skb is merged
742*4882a593Smuzhiyun
743*4882a593Smuzhiyun* TcpExtTCPSackShiftFallback
744*4882a593Smuzhiyun
745*4882a593SmuzhiyunA skb should be shifted or merged, but the TCP stack doesn't do it for
746*4882a593Smuzhiyunsome reasons.
747*4882a593Smuzhiyun
748*4882a593SmuzhiyunTCP out of order
749*4882a593Smuzhiyun================
750*4882a593Smuzhiyun* TcpExtTCPOFOQueue
751*4882a593Smuzhiyun
752*4882a593SmuzhiyunThe TCP layer receives an out of order packet and has enough memory
753*4882a593Smuzhiyunto queue it.
754*4882a593Smuzhiyun
755*4882a593Smuzhiyun* TcpExtTCPOFODrop
756*4882a593Smuzhiyun
757*4882a593SmuzhiyunThe TCP layer receives an out of order packet but doesn't have enough
758*4882a593Smuzhiyunmemory, so drops it. Such packets won't be counted into
759*4882a593SmuzhiyunTcpExtTCPOFOQueue.
760*4882a593Smuzhiyun
761*4882a593Smuzhiyun* TcpExtTCPOFOMerge
762*4882a593Smuzhiyun
763*4882a593SmuzhiyunThe received out of order packet has an overlay with the previous
764*4882a593Smuzhiyunpacket. the overlay part will be dropped. All of TcpExtTCPOFOMerge
765*4882a593Smuzhiyunpackets will also be counted into TcpExtTCPOFOQueue.
766*4882a593Smuzhiyun
767*4882a593SmuzhiyunTCP PAWS
768*4882a593Smuzhiyun========
769*4882a593SmuzhiyunPAWS (Protection Against Wrapped Sequence numbers) is an algorithm
770*4882a593Smuzhiyunwhich is used to drop old packets. It depends on the TCP
771*4882a593Smuzhiyuntimestamps. For detail information, please refer the `timestamp wiki`_
772*4882a593Smuzhiyunand the `RFC of PAWS`_.
773*4882a593Smuzhiyun
774*4882a593Smuzhiyun.. _RFC of PAWS: https://tools.ietf.org/html/rfc1323#page-17
775*4882a593Smuzhiyun.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps
776*4882a593Smuzhiyun
777*4882a593Smuzhiyun* TcpExtPAWSActive
778*4882a593Smuzhiyun
779*4882a593SmuzhiyunPackets are dropped by PAWS in Syn-Sent status.
780*4882a593Smuzhiyun
781*4882a593Smuzhiyun* TcpExtPAWSEstab
782*4882a593Smuzhiyun
783*4882a593SmuzhiyunPackets are dropped by PAWS in any status other than Syn-Sent.
784*4882a593Smuzhiyun
785*4882a593SmuzhiyunTCP ACK skip
786*4882a593Smuzhiyun============
787*4882a593SmuzhiyunIn some scenarios, kernel would avoid sending duplicate ACKs too
788*4882a593Smuzhiyunfrequently. Please find more details in the tcp_invalid_ratelimit
789*4882a593Smuzhiyunsection of the `sysctl document`_. When kernel decides to skip an ACK
790*4882a593Smuzhiyundue to tcp_invalid_ratelimit, kernel would update one of below
791*4882a593Smuzhiyuncounters to indicate the ACK is skipped in which scenario. The ACK
792*4882a593Smuzhiyunwould only be skipped if the received packet is either a SYN packet or
793*4882a593Smuzhiyunit has no data.
794*4882a593Smuzhiyun
795*4882a593Smuzhiyun.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.rst
796*4882a593Smuzhiyun
797*4882a593Smuzhiyun* TcpExtTCPACKSkippedSynRecv
798*4882a593Smuzhiyun
799*4882a593SmuzhiyunThe ACK is skipped in Syn-Recv status. The Syn-Recv status means the
800*4882a593SmuzhiyunTCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is
801*4882a593Smuzhiyunwaiting for an ACK. Generally, the TCP stack doesn't need to send ACK
802*4882a593Smuzhiyunin the Syn-Recv status. But in several scenarios, the TCP stack need
803*4882a593Smuzhiyunto send an ACK. E.g., the TCP stack receives the same SYN packet
804*4882a593Smuzhiyunrepeately, the received packet does not pass the PAWS check, or the
805*4882a593Smuzhiyunreceived packet sequence number is out of window. In these scenarios,
806*4882a593Smuzhiyunthe TCP stack needs to send ACK. If the ACk sending frequency is higher than
807*4882a593Smuzhiyuntcp_invalid_ratelimit allows, the TCP stack will skip sending ACK and
808*4882a593Smuzhiyunincrease TcpExtTCPACKSkippedSynRecv.
809*4882a593Smuzhiyun
810*4882a593Smuzhiyun
811*4882a593Smuzhiyun* TcpExtTCPACKSkippedPAWS
812*4882a593Smuzhiyun
813*4882a593SmuzhiyunThe ACK is skipped due to PAWS (Protect Against Wrapped Sequence
814*4882a593Smuzhiyunnumbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2
815*4882a593Smuzhiyunor Time-Wait statuses, the skipped ACK would be counted to
816*4882a593SmuzhiyunTcpExtTCPACKSkippedSynRecv, TcpExtTCPACKSkippedFinWait2 or
817*4882a593SmuzhiyunTcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK
818*4882a593Smuzhiyunwould be counted to TcpExtTCPACKSkippedPAWS.
819*4882a593Smuzhiyun
820*4882a593Smuzhiyun* TcpExtTCPACKSkippedSeq
821*4882a593Smuzhiyun
822*4882a593SmuzhiyunThe sequence number is out of window and the timestamp passes the PAWS
823*4882a593Smuzhiyuncheck and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait.
824*4882a593Smuzhiyun
825*4882a593Smuzhiyun* TcpExtTCPACKSkippedFinWait2
826*4882a593Smuzhiyun
827*4882a593SmuzhiyunThe ACK is skipped in Fin-Wait-2 status, the reason would be either
828*4882a593SmuzhiyunPAWS check fails or the received sequence number is out of window.
829*4882a593Smuzhiyun
830*4882a593Smuzhiyun* TcpExtTCPACKSkippedTimeWait
831*4882a593Smuzhiyun
832*4882a593SmuzhiyunTha ACK is skipped in Time-Wait status, the reason would be either
833*4882a593SmuzhiyunPAWS check failed or the received sequence number is out of window.
834*4882a593Smuzhiyun
835*4882a593Smuzhiyun* TcpExtTCPACKSkippedChallenge
836*4882a593Smuzhiyun
837*4882a593SmuzhiyunThe ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines
838*4882a593Smuzhiyun3 kind of challenge ACK, please refer `RFC 5961 section 3.2`_,
839*4882a593Smuzhiyun`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these
840*4882a593Smuzhiyunthree scenarios, In some TCP status, the linux TCP stack would also
841*4882a593Smuzhiyunsend challenge ACKs if the ACK number is before the first
842*4882a593Smuzhiyununacknowledged number (more strict than `RFC 5961 section 5.2`_).
843*4882a593Smuzhiyun
844*4882a593Smuzhiyun.. _RFC 5961 section 3.2: https://tools.ietf.org/html/rfc5961#page-7
845*4882a593Smuzhiyun.. _RFC 5961 section 4.2: https://tools.ietf.org/html/rfc5961#page-9
846*4882a593Smuzhiyun.. _RFC 5961 section 5.2: https://tools.ietf.org/html/rfc5961#page-11
847*4882a593Smuzhiyun
848*4882a593SmuzhiyunTCP receive window
849*4882a593Smuzhiyun==================
850*4882a593Smuzhiyun* TcpExtTCPWantZeroWindowAdv
851*4882a593Smuzhiyun
852*4882a593SmuzhiyunDepending on current memory usage, the TCP stack tries to set receive
853*4882a593Smuzhiyunwindow to zero. But the receive window might still be a no-zero
854*4882a593Smuzhiyunvalue. For example, if the previous window size is 10, and the TCP
855*4882a593Smuzhiyunstack receives 3 bytes, the current window size would be 7 even if the
856*4882a593Smuzhiyunwindow size calculated by the memory usage is zero.
857*4882a593Smuzhiyun
858*4882a593Smuzhiyun* TcpExtTCPToZeroWindowAdv
859*4882a593Smuzhiyun
860*4882a593SmuzhiyunThe TCP receive window is set to zero from a no-zero value.
861*4882a593Smuzhiyun
862*4882a593Smuzhiyun* TcpExtTCPFromZeroWindowAdv
863*4882a593Smuzhiyun
864*4882a593SmuzhiyunThe TCP receive window is set to no-zero value from zero.
865*4882a593Smuzhiyun
866*4882a593Smuzhiyun
867*4882a593SmuzhiyunDelayed ACK
868*4882a593Smuzhiyun===========
869*4882a593SmuzhiyunThe TCP Delayed ACK is a technique which is used for reducing the
870*4882a593Smuzhiyunpacket count in the network. For more details, please refer the
871*4882a593Smuzhiyun`Delayed ACK wiki`_
872*4882a593Smuzhiyun
873*4882a593Smuzhiyun.. _Delayed ACK wiki: https://en.wikipedia.org/wiki/TCP_delayed_acknowledgment
874*4882a593Smuzhiyun
875*4882a593Smuzhiyun* TcpExtDelayedACKs
876*4882a593Smuzhiyun
877*4882a593SmuzhiyunA delayed ACK timer expires. The TCP stack will send a pure ACK packet
878*4882a593Smuzhiyunand exit the delayed ACK mode.
879*4882a593Smuzhiyun
880*4882a593Smuzhiyun* TcpExtDelayedACKLocked
881*4882a593Smuzhiyun
882*4882a593SmuzhiyunA delayed ACK timer expires, but the TCP stack can't send an ACK
883*4882a593Smuzhiyunimmediately due to the socket is locked by a userspace program. The
884*4882a593SmuzhiyunTCP stack will send a pure ACK later (after the userspace program
885*4882a593Smuzhiyununlock the socket). When the TCP stack sends the pure ACK later, the
886*4882a593SmuzhiyunTCP stack will also update TcpExtDelayedACKs and exit the delayed ACK
887*4882a593Smuzhiyunmode.
888*4882a593Smuzhiyun
889*4882a593Smuzhiyun* TcpExtDelayedACKLost
890*4882a593Smuzhiyun
891*4882a593SmuzhiyunIt will be updated when the TCP stack receives a packet which has been
892*4882a593SmuzhiyunACKed. A Delayed ACK loss might cause this issue, but it would also be
893*4882a593Smuzhiyuntriggered by other reasons, such as a packet is duplicated in the
894*4882a593Smuzhiyunnetwork.
895*4882a593Smuzhiyun
896*4882a593SmuzhiyunTail Loss Probe (TLP)
897*4882a593Smuzhiyun=====================
898*4882a593SmuzhiyunTLP is an algorithm which is used to detect TCP packet loss. For more
899*4882a593Smuzhiyundetails, please refer the `TLP paper`_.
900*4882a593Smuzhiyun
901*4882a593Smuzhiyun.. _TLP paper: https://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
902*4882a593Smuzhiyun
903*4882a593Smuzhiyun* TcpExtTCPLossProbes
904*4882a593Smuzhiyun
905*4882a593SmuzhiyunA TLP probe packet is sent.
906*4882a593Smuzhiyun
907*4882a593Smuzhiyun* TcpExtTCPLossProbeRecovery
908*4882a593Smuzhiyun
909*4882a593SmuzhiyunA packet loss is detected and recovered by TLP.
910*4882a593Smuzhiyun
911*4882a593SmuzhiyunTCP Fast Open description
912*4882a593Smuzhiyun=========================
913*4882a593SmuzhiyunTCP Fast Open is a technology which allows data transfer before the
914*4882a593Smuzhiyun3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a
915*4882a593Smuzhiyungeneral description.
916*4882a593Smuzhiyun
917*4882a593Smuzhiyun.. _TCP Fast Open wiki: https://en.wikipedia.org/wiki/TCP_Fast_Open
918*4882a593Smuzhiyun
919*4882a593Smuzhiyun* TcpExtTCPFastOpenActive
920*4882a593Smuzhiyun
921*4882a593SmuzhiyunWhen the TCP stack receives an ACK packet in the SYN-SENT status, and
922*4882a593Smuzhiyunthe ACK packet acknowledges the data in the SYN packet, the TCP stack
923*4882a593Smuzhiyununderstand the TFO cookie is accepted by the other side, then it
924*4882a593Smuzhiyunupdates this counter.
925*4882a593Smuzhiyun
926*4882a593Smuzhiyun* TcpExtTCPFastOpenActiveFail
927*4882a593Smuzhiyun
928*4882a593SmuzhiyunThis counter indicates that the TCP stack initiated a TCP Fast Open,
929*4882a593Smuzhiyunbut it failed. This counter would be updated in three scenarios: (1)
930*4882a593Smuzhiyunthe other side doesn't acknowledge the data in the SYN packet. (2) The
931*4882a593SmuzhiyunSYN packet which has the TFO cookie is timeout at least once. (3)
932*4882a593Smuzhiyunafter the 3-way handshake, the retransmission timeout happens
933*4882a593Smuzhiyunnet.ipv4.tcp_retries1 times, because some middle-boxes may black-hole
934*4882a593Smuzhiyunfast open after the handshake.
935*4882a593Smuzhiyun
936*4882a593Smuzhiyun* TcpExtTCPFastOpenPassive
937*4882a593Smuzhiyun
938*4882a593SmuzhiyunThis counter indicates how many times the TCP stack accepts the fast
939*4882a593Smuzhiyunopen request.
940*4882a593Smuzhiyun
941*4882a593Smuzhiyun* TcpExtTCPFastOpenPassiveFail
942*4882a593Smuzhiyun
943*4882a593SmuzhiyunThis counter indicates how many times the TCP stack rejects the fast
944*4882a593Smuzhiyunopen request. It is caused by either the TFO cookie is invalid or the
945*4882a593SmuzhiyunTCP stack finds an error during the socket creating process.
946*4882a593Smuzhiyun
947*4882a593Smuzhiyun* TcpExtTCPFastOpenListenOverflow
948*4882a593Smuzhiyun
949*4882a593SmuzhiyunWhen the pending fast open request number is larger than
950*4882a593Smuzhiyunfastopenq->max_qlen, the TCP stack will reject the fast open request
951*4882a593Smuzhiyunand update this counter. When this counter is updated, the TCP stack
952*4882a593Smuzhiyunwon't update TcpExtTCPFastOpenPassive or
953*4882a593SmuzhiyunTcpExtTCPFastOpenPassiveFail. The fastopenq->max_qlen is set by the
954*4882a593SmuzhiyunTCP_FASTOPEN socket operation and it could not be larger than
955*4882a593Smuzhiyunnet.core.somaxconn. For example:
956*4882a593Smuzhiyun
957*4882a593Smuzhiyunsetsockopt(sfd, SOL_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen));
958*4882a593Smuzhiyun
959*4882a593Smuzhiyun* TcpExtTCPFastOpenCookieReqd
960*4882a593Smuzhiyun
961*4882a593SmuzhiyunThis counter indicates how many times a client wants to request a TFO
962*4882a593Smuzhiyuncookie.
963*4882a593Smuzhiyun
964*4882a593SmuzhiyunSYN cookies
965*4882a593Smuzhiyun===========
966*4882a593SmuzhiyunSYN cookies are used to mitigate SYN flood, for details, please refer
967*4882a593Smuzhiyunthe `SYN cookies wiki`_.
968*4882a593Smuzhiyun
969*4882a593Smuzhiyun.. _SYN cookies wiki: https://en.wikipedia.org/wiki/SYN_cookies
970*4882a593Smuzhiyun
971*4882a593Smuzhiyun* TcpExtSyncookiesSent
972*4882a593Smuzhiyun
973*4882a593SmuzhiyunIt indicates how many SYN cookies are sent.
974*4882a593Smuzhiyun
975*4882a593Smuzhiyun* TcpExtSyncookiesRecv
976*4882a593Smuzhiyun
977*4882a593SmuzhiyunHow many reply packets of the SYN cookies the TCP stack receives.
978*4882a593Smuzhiyun
979*4882a593Smuzhiyun* TcpExtSyncookiesFailed
980*4882a593Smuzhiyun
981*4882a593SmuzhiyunThe MSS decoded from the SYN cookie is invalid. When this counter is
982*4882a593Smuzhiyunupdated, the received packet won't be treated as a SYN cookie and the
983*4882a593SmuzhiyunTcpExtSyncookiesRecv counter wont be updated.
984*4882a593Smuzhiyun
985*4882a593SmuzhiyunChallenge ACK
986*4882a593Smuzhiyun=============
987*4882a593SmuzhiyunFor details of challenge ACK, please refer the explaination of
988*4882a593SmuzhiyunTcpExtTCPACKSkippedChallenge.
989*4882a593Smuzhiyun
990*4882a593Smuzhiyun* TcpExtTCPChallengeACK
991*4882a593Smuzhiyun
992*4882a593SmuzhiyunThe number of challenge acks sent.
993*4882a593Smuzhiyun
994*4882a593Smuzhiyun* TcpExtTCPSYNChallenge
995*4882a593Smuzhiyun
996*4882a593SmuzhiyunThe number of challenge acks sent in response to SYN packets. After
997*4882a593Smuzhiyunupdates this counter, the TCP stack might send a challenge ACK and
998*4882a593Smuzhiyunupdate the TcpExtTCPChallengeACK counter, or it might also skip to
999*4882a593Smuzhiyunsend the challenge and update the TcpExtTCPACKSkippedChallenge.
1000*4882a593Smuzhiyun
1001*4882a593Smuzhiyunprune
1002*4882a593Smuzhiyun=====
1003*4882a593SmuzhiyunWhen a socket is under memory pressure, the TCP stack will try to
1004*4882a593Smuzhiyunreclaim memory from the receiving queue and out of order queue. One of
1005*4882a593Smuzhiyunthe reclaiming method is 'collapse', which means allocate a big sbk,
1006*4882a593Smuzhiyuncopy the contiguous skbs to the single big skb, and free these
1007*4882a593Smuzhiyuncontiguous skbs.
1008*4882a593Smuzhiyun
1009*4882a593Smuzhiyun* TcpExtPruneCalled
1010*4882a593Smuzhiyun
1011*4882a593SmuzhiyunThe TCP stack tries to reclaim memory for a socket. After updates this
1012*4882a593Smuzhiyuncounter, the TCP stack will try to collapse the out of order queue and
1013*4882a593Smuzhiyunthe receiving queue. If the memory is still not enough, the TCP stack
1014*4882a593Smuzhiyunwill try to discard packets from the out of order queue (and update the
1015*4882a593SmuzhiyunTcpExtOfoPruned counter)
1016*4882a593Smuzhiyun
1017*4882a593Smuzhiyun* TcpExtOfoPruned
1018*4882a593Smuzhiyun
1019*4882a593SmuzhiyunThe TCP stack tries to discard packet on the out of order queue.
1020*4882a593Smuzhiyun
1021*4882a593Smuzhiyun* TcpExtRcvPruned
1022*4882a593Smuzhiyun
1023*4882a593SmuzhiyunAfter 'collapse' and discard packets from the out of order queue, if
1024*4882a593Smuzhiyunthe actually used memory is still larger than the max allowed memory,
1025*4882a593Smuzhiyunthis counter will be updated. It means the 'prune' fails.
1026*4882a593Smuzhiyun
1027*4882a593Smuzhiyun* TcpExtTCPRcvCollapsed
1028*4882a593Smuzhiyun
1029*4882a593SmuzhiyunThis counter indicates how many skbs are freed during 'collapse'.
1030*4882a593Smuzhiyun
1031*4882a593Smuzhiyunexamples
1032*4882a593Smuzhiyun========
1033*4882a593Smuzhiyun
1034*4882a593Smuzhiyunping test
1035*4882a593Smuzhiyun---------
1036*4882a593SmuzhiyunRun the ping command against the public dns server 8.8.8.8::
1037*4882a593Smuzhiyun
1038*4882a593Smuzhiyun  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
1039*4882a593Smuzhiyun  PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
1040*4882a593Smuzhiyun  64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
1041*4882a593Smuzhiyun
1042*4882a593Smuzhiyun  --- 8.8.8.8 ping statistics ---
1043*4882a593Smuzhiyun  1 packets transmitted, 1 received, 0% packet loss, time 0ms
1044*4882a593Smuzhiyun  rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
1045*4882a593Smuzhiyun
1046*4882a593SmuzhiyunThe nstayt result::
1047*4882a593Smuzhiyun
1048*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nstat
1049*4882a593Smuzhiyun  #kernel
1050*4882a593Smuzhiyun  IpInReceives                    1                  0.0
1051*4882a593Smuzhiyun  IpInDelivers                    1                  0.0
1052*4882a593Smuzhiyun  IpOutRequests                   1                  0.0
1053*4882a593Smuzhiyun  IcmpInMsgs                      1                  0.0
1054*4882a593Smuzhiyun  IcmpInEchoReps                  1                  0.0
1055*4882a593Smuzhiyun  IcmpOutMsgs                     1                  0.0
1056*4882a593Smuzhiyun  IcmpOutEchos                    1                  0.0
1057*4882a593Smuzhiyun  IcmpMsgInType0                  1                  0.0
1058*4882a593Smuzhiyun  IcmpMsgOutType8                 1                  0.0
1059*4882a593Smuzhiyun  IpExtInOctets                   84                 0.0
1060*4882a593Smuzhiyun  IpExtOutOctets                  84                 0.0
1061*4882a593Smuzhiyun  IpExtInNoECTPkts                1                  0.0
1062*4882a593Smuzhiyun
1063*4882a593SmuzhiyunThe Linux server sent an ICMP Echo packet, so IpOutRequests,
1064*4882a593SmuzhiyunIcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The
1065*4882a593Smuzhiyunserver got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs,
1066*4882a593SmuzhiyunIcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply
1067*4882a593Smuzhiyunwas passed to the ICMP layer via IP layer, so IpInDelivers was
1068*4882a593Smuzhiyunincreased 1. The default ping data size is 48, so an ICMP Echo packet
1069*4882a593Smuzhiyunand its corresponding Echo Reply packet are constructed by:
1070*4882a593Smuzhiyun
1071*4882a593Smuzhiyun* 14 bytes MAC header
1072*4882a593Smuzhiyun* 20 bytes IP header
1073*4882a593Smuzhiyun* 16 bytes ICMP header
1074*4882a593Smuzhiyun* 48 bytes data (default value of the ping command)
1075*4882a593Smuzhiyun
1076*4882a593SmuzhiyunSo the IpExtInOctets and IpExtOutOctets are 20+16+48=84.
1077*4882a593Smuzhiyun
1078*4882a593Smuzhiyuntcp 3-way handshake
1079*4882a593Smuzhiyun-------------------
1080*4882a593SmuzhiyunOn server side, we run::
1081*4882a593Smuzhiyun
1082*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
1083*4882a593Smuzhiyun  Listening on [0.0.0.0] (family 0, port 9000)
1084*4882a593Smuzhiyun
1085*4882a593SmuzhiyunOn client side, we run::
1086*4882a593Smuzhiyun
1087*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
1088*4882a593Smuzhiyun  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
1089*4882a593Smuzhiyun
1090*4882a593SmuzhiyunThe server listened on tcp 9000 port, the client connected to it, they
1091*4882a593Smuzhiyuncompleted the 3-way handshake.
1092*4882a593Smuzhiyun
1093*4882a593SmuzhiyunOn server side, we can find below nstat output::
1094*4882a593Smuzhiyun
1095*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nstat | grep -i tcp
1096*4882a593Smuzhiyun  TcpPassiveOpens                 1                  0.0
1097*4882a593Smuzhiyun  TcpInSegs                       2                  0.0
1098*4882a593Smuzhiyun  TcpOutSegs                      1                  0.0
1099*4882a593Smuzhiyun  TcpExtTCPPureAcks               1                  0.0
1100*4882a593Smuzhiyun
1101*4882a593SmuzhiyunOn client side, we can find below nstat output::
1102*4882a593Smuzhiyun
1103*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nstat | grep -i tcp
1104*4882a593Smuzhiyun  TcpActiveOpens                  1                  0.0
1105*4882a593Smuzhiyun  TcpInSegs                       1                  0.0
1106*4882a593Smuzhiyun  TcpOutSegs                      2                  0.0
1107*4882a593Smuzhiyun
1108*4882a593SmuzhiyunWhen the server received the first SYN, it replied a SYN+ACK, and came into
1109*4882a593SmuzhiyunSYN-RCVD state, so TcpPassiveOpens increased 1. The server received
1110*4882a593SmuzhiyunSYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2
1111*4882a593Smuzhiyunpackets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK
1112*4882a593Smuzhiyunof the 3-way handshake is a pure ACK without data, so
1113*4882a593SmuzhiyunTcpExtTCPPureAcks increased 1.
1114*4882a593Smuzhiyun
1115*4882a593SmuzhiyunWhen the client sent SYN, the client came into the SYN-SENT state, so
1116*4882a593SmuzhiyunTcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent
1117*4882a593SmuzhiyunACK, so client sent 2 packets, received 1 packet, TcpInSegs increased
1118*4882a593Smuzhiyun1, TcpOutSegs increased 2.
1119*4882a593Smuzhiyun
1120*4882a593SmuzhiyunTCP normal traffic
1121*4882a593Smuzhiyun------------------
1122*4882a593SmuzhiyunRun nc on server::
1123*4882a593Smuzhiyun
1124*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
1125*4882a593Smuzhiyun  Listening on [0.0.0.0] (family 0, port 9000)
1126*4882a593Smuzhiyun
1127*4882a593SmuzhiyunRun nc on client::
1128*4882a593Smuzhiyun
1129*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1130*4882a593Smuzhiyun  Connection to nstat-b 9000 port [tcp/*] succeeded!
1131*4882a593Smuzhiyun
1132*4882a593SmuzhiyunInput a string in the nc client ('hello' in our example)::
1133*4882a593Smuzhiyun
1134*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1135*4882a593Smuzhiyun  Connection to nstat-b 9000 port [tcp/*] succeeded!
1136*4882a593Smuzhiyun  hello
1137*4882a593Smuzhiyun
1138*4882a593SmuzhiyunThe client side nstat output::
1139*4882a593Smuzhiyun
1140*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nstat
1141*4882a593Smuzhiyun  #kernel
1142*4882a593Smuzhiyun  IpInReceives                    1                  0.0
1143*4882a593Smuzhiyun  IpInDelivers                    1                  0.0
1144*4882a593Smuzhiyun  IpOutRequests                   1                  0.0
1145*4882a593Smuzhiyun  TcpInSegs                       1                  0.0
1146*4882a593Smuzhiyun  TcpOutSegs                      1                  0.0
1147*4882a593Smuzhiyun  TcpExtTCPPureAcks               1                  0.0
1148*4882a593Smuzhiyun  TcpExtTCPOrigDataSent           1                  0.0
1149*4882a593Smuzhiyun  IpExtInOctets                   52                 0.0
1150*4882a593Smuzhiyun  IpExtOutOctets                  58                 0.0
1151*4882a593Smuzhiyun  IpExtInNoECTPkts                1                  0.0
1152*4882a593Smuzhiyun
1153*4882a593SmuzhiyunThe server side nstat output::
1154*4882a593Smuzhiyun
1155*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nstat
1156*4882a593Smuzhiyun  #kernel
1157*4882a593Smuzhiyun  IpInReceives                    1                  0.0
1158*4882a593Smuzhiyun  IpInDelivers                    1                  0.0
1159*4882a593Smuzhiyun  IpOutRequests                   1                  0.0
1160*4882a593Smuzhiyun  TcpInSegs                       1                  0.0
1161*4882a593Smuzhiyun  TcpOutSegs                      1                  0.0
1162*4882a593Smuzhiyun  IpExtInOctets                   58                 0.0
1163*4882a593Smuzhiyun  IpExtOutOctets                  52                 0.0
1164*4882a593Smuzhiyun  IpExtInNoECTPkts                1                  0.0
1165*4882a593Smuzhiyun
1166*4882a593SmuzhiyunInput a string in nc client side again ('world' in our exmaple)::
1167*4882a593Smuzhiyun
1168*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1169*4882a593Smuzhiyun  Connection to nstat-b 9000 port [tcp/*] succeeded!
1170*4882a593Smuzhiyun  hello
1171*4882a593Smuzhiyun  world
1172*4882a593Smuzhiyun
1173*4882a593SmuzhiyunClient side nstat output::
1174*4882a593Smuzhiyun
1175*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nstat
1176*4882a593Smuzhiyun  #kernel
1177*4882a593Smuzhiyun  IpInReceives                    1                  0.0
1178*4882a593Smuzhiyun  IpInDelivers                    1                  0.0
1179*4882a593Smuzhiyun  IpOutRequests                   1                  0.0
1180*4882a593Smuzhiyun  TcpInSegs                       1                  0.0
1181*4882a593Smuzhiyun  TcpOutSegs                      1                  0.0
1182*4882a593Smuzhiyun  TcpExtTCPHPAcks                 1                  0.0
1183*4882a593Smuzhiyun  TcpExtTCPOrigDataSent           1                  0.0
1184*4882a593Smuzhiyun  IpExtInOctets                   52                 0.0
1185*4882a593Smuzhiyun  IpExtOutOctets                  58                 0.0
1186*4882a593Smuzhiyun  IpExtInNoECTPkts                1                  0.0
1187*4882a593Smuzhiyun
1188*4882a593Smuzhiyun
1189*4882a593SmuzhiyunServer side nstat output::
1190*4882a593Smuzhiyun
1191*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nstat
1192*4882a593Smuzhiyun  #kernel
1193*4882a593Smuzhiyun  IpInReceives                    1                  0.0
1194*4882a593Smuzhiyun  IpInDelivers                    1                  0.0
1195*4882a593Smuzhiyun  IpOutRequests                   1                  0.0
1196*4882a593Smuzhiyun  TcpInSegs                       1                  0.0
1197*4882a593Smuzhiyun  TcpOutSegs                      1                  0.0
1198*4882a593Smuzhiyun  TcpExtTCPHPHits                 1                  0.0
1199*4882a593Smuzhiyun  IpExtInOctets                   58                 0.0
1200*4882a593Smuzhiyun  IpExtOutOctets                  52                 0.0
1201*4882a593Smuzhiyun  IpExtInNoECTPkts                1                  0.0
1202*4882a593Smuzhiyun
1203*4882a593SmuzhiyunCompare the first client-side nstat and the second client-side nstat,
1204*4882a593Smuzhiyunwe could find one difference: the first one had a 'TcpExtTCPPureAcks',
1205*4882a593Smuzhiyunbut the second one had a 'TcpExtTCPHPAcks'. The first server-side
1206*4882a593Smuzhiyunnstat and the second server-side nstat had a difference too: the
1207*4882a593Smuzhiyunsecond server-side nstat had a TcpExtTCPHPHits, but the first
1208*4882a593Smuzhiyunserver-side nstat didn't have it. The network traffic patterns were
1209*4882a593Smuzhiyunexactly the same: the client sent a packet to the server, the server
1210*4882a593Smuzhiyunreplied an ACK. But kernel handled them in different ways. When the
1211*4882a593SmuzhiyunTCP window scale option is not used, kernel will try to enable fast
1212*4882a593Smuzhiyunpath immediately when the connection comes into the established state,
1213*4882a593Smuzhiyunbut if the TCP window scale option is used, kernel will disable the
1214*4882a593Smuzhiyunfast path at first, and try to enable it after kerenl receives
1215*4882a593Smuzhiyunpackets. We could use the 'ss' command to verify whether the window
1216*4882a593Smuzhiyunscale option is used. e.g. run below command on either server or
1217*4882a593Smuzhiyunclient::
1218*4882a593Smuzhiyun
1219*4882a593Smuzhiyun  nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
1220*4882a593Smuzhiyun  Netid    Recv-Q     Send-Q            Local Address:Port             Peer Address:Port
1221*4882a593Smuzhiyun  tcp      0          0               192.168.122.250:40654         192.168.122.251:9000
1222*4882a593Smuzhiyun             ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
1223*4882a593Smuzhiyun
1224*4882a593SmuzhiyunThe 'wscale:7,7' means both server and client set the window scale
1225*4882a593Smuzhiyunoption to 7. Now we could explain the nstat output in our test:
1226*4882a593Smuzhiyun
1227*4882a593SmuzhiyunIn the first nstat output of client side, the client sent a packet, server
1228*4882a593Smuzhiyunreply an ACK, when kernel handled this ACK, the fast path was not
1229*4882a593Smuzhiyunenabled, so the ACK was counted into 'TcpExtTCPPureAcks'.
1230*4882a593Smuzhiyun
1231*4882a593SmuzhiyunIn the second nstat output of client side, the client sent a packet again,
1232*4882a593Smuzhiyunand received another ACK from the server, in this time, the fast path is
1233*4882a593Smuzhiyunenabled, and the ACK was qualified for fast path, so it was handled by
1234*4882a593Smuzhiyunthe fast path, so this ACK was counted into TcpExtTCPHPAcks.
1235*4882a593Smuzhiyun
1236*4882a593SmuzhiyunIn the first nstat output of server side, fast path was not enabled,
1237*4882a593Smuzhiyunso there was no 'TcpExtTCPHPHits'.
1238*4882a593Smuzhiyun
1239*4882a593SmuzhiyunIn the second nstat output of server side, the fast path was enabled,
1240*4882a593Smuzhiyunand the packet received from client qualified for fast path, so it
1241*4882a593Smuzhiyunwas counted into 'TcpExtTCPHPHits'.
1242*4882a593Smuzhiyun
1243*4882a593SmuzhiyunTcpExtTCPAbortOnClose
1244*4882a593Smuzhiyun---------------------
1245*4882a593SmuzhiyunOn the server side, we run below python script::
1246*4882a593Smuzhiyun
1247*4882a593Smuzhiyun  import socket
1248*4882a593Smuzhiyun  import time
1249*4882a593Smuzhiyun
1250*4882a593Smuzhiyun  port = 9000
1251*4882a593Smuzhiyun
1252*4882a593Smuzhiyun  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1253*4882a593Smuzhiyun  s.bind(('0.0.0.0', port))
1254*4882a593Smuzhiyun  s.listen(1)
1255*4882a593Smuzhiyun  sock, addr = s.accept()
1256*4882a593Smuzhiyun  while True:
1257*4882a593Smuzhiyun      time.sleep(9999999)
1258*4882a593Smuzhiyun
1259*4882a593SmuzhiyunThis python script listen on 9000 port, but doesn't read anything from
1260*4882a593Smuzhiyunthe connection.
1261*4882a593Smuzhiyun
1262*4882a593SmuzhiyunOn the client side, we send the string "hello" by nc::
1263*4882a593Smuzhiyun
1264*4882a593Smuzhiyun  nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000
1265*4882a593Smuzhiyun
1266*4882a593SmuzhiyunThen, we come back to the server side, the server has received the "hello"
1267*4882a593Smuzhiyunpacket, and the TCP layer has acked this packet, but the application didn't
1268*4882a593Smuzhiyunread it yet. We type Ctrl-C to terminate the server script. Then we
1269*4882a593Smuzhiyuncould find TcpExtTCPAbortOnClose increased 1 on the server side::
1270*4882a593Smuzhiyun
1271*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nstat | grep -i abort
1272*4882a593Smuzhiyun  TcpExtTCPAbortOnClose           1                  0.0
1273*4882a593Smuzhiyun
1274*4882a593SmuzhiyunIf we run tcpdump on the server side, we could find the server sent a
1275*4882a593SmuzhiyunRST after we type Ctrl-C.
1276*4882a593Smuzhiyun
1277*4882a593SmuzhiyunTcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout
1278*4882a593Smuzhiyun---------------------------------------------------
1279*4882a593SmuzhiyunBelow is an example which let the orphan socket count be higher than
1280*4882a593Smuzhiyunnet.ipv4.tcp_max_orphans.
1281*4882a593SmuzhiyunChange tcp_max_orphans to a smaller value on client::
1282*4882a593Smuzhiyun
1283*4882a593Smuzhiyun  sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
1284*4882a593Smuzhiyun
1285*4882a593SmuzhiyunClient code (create 64 connection to server)::
1286*4882a593Smuzhiyun
1287*4882a593Smuzhiyun  nstatuser@nstat-a:~$ cat client_orphan.py
1288*4882a593Smuzhiyun  import socket
1289*4882a593Smuzhiyun  import time
1290*4882a593Smuzhiyun
1291*4882a593Smuzhiyun  server = 'nstat-b' # server address
1292*4882a593Smuzhiyun  port = 9000
1293*4882a593Smuzhiyun
1294*4882a593Smuzhiyun  count = 64
1295*4882a593Smuzhiyun
1296*4882a593Smuzhiyun  connection_list = []
1297*4882a593Smuzhiyun
1298*4882a593Smuzhiyun  for i in range(64):
1299*4882a593Smuzhiyun      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1300*4882a593Smuzhiyun      s.connect((server, port))
1301*4882a593Smuzhiyun      connection_list.append(s)
1302*4882a593Smuzhiyun      print("connection_count: %d" % len(connection_list))
1303*4882a593Smuzhiyun
1304*4882a593Smuzhiyun  while True:
1305*4882a593Smuzhiyun      time.sleep(99999)
1306*4882a593Smuzhiyun
1307*4882a593SmuzhiyunServer code (accept 64 connection from client)::
1308*4882a593Smuzhiyun
1309*4882a593Smuzhiyun  nstatuser@nstat-b:~$ cat server_orphan.py
1310*4882a593Smuzhiyun  import socket
1311*4882a593Smuzhiyun  import time
1312*4882a593Smuzhiyun
1313*4882a593Smuzhiyun  port = 9000
1314*4882a593Smuzhiyun  count = 64
1315*4882a593Smuzhiyun
1316*4882a593Smuzhiyun  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1317*4882a593Smuzhiyun  s.bind(('0.0.0.0', port))
1318*4882a593Smuzhiyun  s.listen(count)
1319*4882a593Smuzhiyun  connection_list = []
1320*4882a593Smuzhiyun  while True:
1321*4882a593Smuzhiyun      sock, addr = s.accept()
1322*4882a593Smuzhiyun      connection_list.append((sock, addr))
1323*4882a593Smuzhiyun      print("connection_count: %d" % len(connection_list))
1324*4882a593Smuzhiyun
1325*4882a593SmuzhiyunRun the python scripts on server and client.
1326*4882a593Smuzhiyun
1327*4882a593SmuzhiyunOn server::
1328*4882a593Smuzhiyun
1329*4882a593Smuzhiyun  python3 server_orphan.py
1330*4882a593Smuzhiyun
1331*4882a593SmuzhiyunOn client::
1332*4882a593Smuzhiyun
1333*4882a593Smuzhiyun  python3 client_orphan.py
1334*4882a593Smuzhiyun
1335*4882a593SmuzhiyunRun iptables on server::
1336*4882a593Smuzhiyun
1337*4882a593Smuzhiyun  sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
1338*4882a593Smuzhiyun
1339*4882a593SmuzhiyunType Ctrl-C on client, stop client_orphan.py.
1340*4882a593Smuzhiyun
1341*4882a593SmuzhiyunCheck TcpExtTCPAbortOnMemory on client::
1342*4882a593Smuzhiyun
1343*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nstat | grep -i abort
1344*4882a593Smuzhiyun  TcpExtTCPAbortOnMemory          54                 0.0
1345*4882a593Smuzhiyun
1346*4882a593SmuzhiyunCheck orphane socket count on client::
1347*4882a593Smuzhiyun
1348*4882a593Smuzhiyun  nstatuser@nstat-a:~$ ss -s
1349*4882a593Smuzhiyun  Total: 131 (kernel 0)
1350*4882a593Smuzhiyun  TCP:   14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
1351*4882a593Smuzhiyun
1352*4882a593Smuzhiyun  Transport Total     IP        IPv6
1353*4882a593Smuzhiyun  *         0         -         -
1354*4882a593Smuzhiyun  RAW       1         0         1
1355*4882a593Smuzhiyun  UDP       1         1         0
1356*4882a593Smuzhiyun  TCP       14        13        1
1357*4882a593Smuzhiyun  INET      16        14        2
1358*4882a593Smuzhiyun  FRAG      0         0         0
1359*4882a593Smuzhiyun
1360*4882a593SmuzhiyunThe explanation of the test: after run server_orphan.py and
1361*4882a593Smuzhiyunclient_orphan.py, we set up 64 connections between server and
1362*4882a593Smuzhiyunclient. Run the iptables command, the server will drop all packets from
1363*4882a593Smuzhiyunthe client, type Ctrl-C on client_orphan.py, the system of the client
1364*4882a593Smuzhiyunwould try to close these connections, and before they are closed
1365*4882a593Smuzhiyungracefully, these connections became orphan sockets. As the iptables
1366*4882a593Smuzhiyunof the server blocked packets from the client, the server won't receive fin
1367*4882a593Smuzhiyunfrom the client, so all connection on clients would be stuck on FIN_WAIT_1
1368*4882a593Smuzhiyunstage, so they will keep as orphan sockets until timeout. We have echo
1369*4882a593Smuzhiyun10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
1370*4882a593Smuzhiyunonly keep 10 orphan sockets, for all other orphan sockets, the client
1371*4882a593Smuzhiyunsystem sent RST for them and delete them. We have 64 connections, so
1372*4882a593Smuzhiyunthe 'ss -s' command shows the system has 10 orphan sockets, and the
1373*4882a593Smuzhiyunvalue of TcpExtTCPAbortOnMemory was 54.
1374*4882a593Smuzhiyun
1375*4882a593SmuzhiyunAn additional explanation about orphan socket count: You could find the
1376*4882a593Smuzhiyunexactly orphan socket count by the 'ss -s' command, but when kernel
1377*4882a593Smuzhiyundecide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel
1378*4882a593Smuzhiyundoesn't always check the exactly orphan socket count. For increasing
1379*4882a593Smuzhiyunperformance, kernel checks an approximate count firstly, if the
1380*4882a593Smuzhiyunapproximate count is more than tcp_max_orphans, kernel checks the
1381*4882a593Smuzhiyunexact count again. So if the approximate count is less than
1382*4882a593Smuzhiyuntcp_max_orphans, but exactly count is more than tcp_max_orphans, you
1383*4882a593Smuzhiyunwould find TcpExtTCPAbortOnMemory is not increased at all. If
1384*4882a593Smuzhiyuntcp_max_orphans is large enough, it won't occur, but if you decrease
1385*4882a593Smuzhiyuntcp_max_orphans to a small value like our test, you might find this
1386*4882a593Smuzhiyunissue. So in our test, the client set up 64 connections although the
1387*4882a593Smuzhiyuntcp_max_orphans is 10. If the client only set up 11 connections, we
1388*4882a593Smuzhiyuncan't find the change of TcpExtTCPAbortOnMemory.
1389*4882a593Smuzhiyun
1390*4882a593SmuzhiyunContinue the previous test, we wait for several minutes. Because of the
1391*4882a593Smuzhiyuniptables on the server blocked the traffic, the server wouldn't receive
1392*4882a593Smuzhiyunfin, and all the client's orphan sockets would timeout on the
1393*4882a593SmuzhiyunFIN_WAIT_1 state finally. So we wait for a few minutes, we could find
1394*4882a593Smuzhiyun10 timeout on the client::
1395*4882a593Smuzhiyun
1396*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nstat | grep -i abort
1397*4882a593Smuzhiyun  TcpExtTCPAbortOnTimeout         10                 0.0
1398*4882a593Smuzhiyun
1399*4882a593SmuzhiyunTcpExtTCPAbortOnLinger
1400*4882a593Smuzhiyun----------------------
1401*4882a593SmuzhiyunThe server side code::
1402*4882a593Smuzhiyun
1403*4882a593Smuzhiyun  nstatuser@nstat-b:~$ cat server_linger.py
1404*4882a593Smuzhiyun  import socket
1405*4882a593Smuzhiyun  import time
1406*4882a593Smuzhiyun
1407*4882a593Smuzhiyun  port = 9000
1408*4882a593Smuzhiyun
1409*4882a593Smuzhiyun  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1410*4882a593Smuzhiyun  s.bind(('0.0.0.0', port))
1411*4882a593Smuzhiyun  s.listen(1)
1412*4882a593Smuzhiyun  sock, addr = s.accept()
1413*4882a593Smuzhiyun  while True:
1414*4882a593Smuzhiyun      time.sleep(9999999)
1415*4882a593Smuzhiyun
1416*4882a593SmuzhiyunThe client side code::
1417*4882a593Smuzhiyun
1418*4882a593Smuzhiyun  nstatuser@nstat-a:~$ cat client_linger.py
1419*4882a593Smuzhiyun  import socket
1420*4882a593Smuzhiyun  import struct
1421*4882a593Smuzhiyun
1422*4882a593Smuzhiyun  server = 'nstat-b' # server address
1423*4882a593Smuzhiyun  port = 9000
1424*4882a593Smuzhiyun
1425*4882a593Smuzhiyun  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1426*4882a593Smuzhiyun  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
1427*4882a593Smuzhiyun  s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))
1428*4882a593Smuzhiyun  s.connect((server, port))
1429*4882a593Smuzhiyun  s.close()
1430*4882a593Smuzhiyun
1431*4882a593SmuzhiyunRun server_linger.py on server::
1432*4882a593Smuzhiyun
1433*4882a593Smuzhiyun  nstatuser@nstat-b:~$ python3 server_linger.py
1434*4882a593Smuzhiyun
1435*4882a593SmuzhiyunRun client_linger.py on client::
1436*4882a593Smuzhiyun
1437*4882a593Smuzhiyun  nstatuser@nstat-a:~$ python3 client_linger.py
1438*4882a593Smuzhiyun
1439*4882a593SmuzhiyunAfter run client_linger.py, check the output of nstat::
1440*4882a593Smuzhiyun
1441*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nstat | grep -i abort
1442*4882a593Smuzhiyun  TcpExtTCPAbortOnLinger          1                  0.0
1443*4882a593Smuzhiyun
1444*4882a593SmuzhiyunTcpExtTCPRcvCoalesce
1445*4882a593Smuzhiyun--------------------
1446*4882a593SmuzhiyunOn the server, we run a program which listen on TCP port 9000, but
1447*4882a593Smuzhiyundoesn't read any data::
1448*4882a593Smuzhiyun
1449*4882a593Smuzhiyun  import socket
1450*4882a593Smuzhiyun  import time
1451*4882a593Smuzhiyun  port = 9000
1452*4882a593Smuzhiyun  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1453*4882a593Smuzhiyun  s.bind(('0.0.0.0', port))
1454*4882a593Smuzhiyun  s.listen(1)
1455*4882a593Smuzhiyun  sock, addr = s.accept()
1456*4882a593Smuzhiyun  while True:
1457*4882a593Smuzhiyun      time.sleep(9999999)
1458*4882a593Smuzhiyun
1459*4882a593SmuzhiyunSave the above code as server_coalesce.py, and run::
1460*4882a593Smuzhiyun
1461*4882a593Smuzhiyun  python3 server_coalesce.py
1462*4882a593Smuzhiyun
1463*4882a593SmuzhiyunOn the client, save below code as client_coalesce.py::
1464*4882a593Smuzhiyun
1465*4882a593Smuzhiyun  import socket
1466*4882a593Smuzhiyun  server = 'nstat-b'
1467*4882a593Smuzhiyun  port = 9000
1468*4882a593Smuzhiyun  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1469*4882a593Smuzhiyun  s.connect((server, port))
1470*4882a593Smuzhiyun
1471*4882a593SmuzhiyunRun::
1472*4882a593Smuzhiyun
1473*4882a593Smuzhiyun  nstatuser@nstat-a:~$ python3 -i client_coalesce.py
1474*4882a593Smuzhiyun
1475*4882a593SmuzhiyunWe use '-i' to come into the interactive mode, then a packet::
1476*4882a593Smuzhiyun
1477*4882a593Smuzhiyun  >>> s.send(b'foo')
1478*4882a593Smuzhiyun  3
1479*4882a593Smuzhiyun
1480*4882a593SmuzhiyunSend a packet again::
1481*4882a593Smuzhiyun
1482*4882a593Smuzhiyun  >>> s.send(b'bar')
1483*4882a593Smuzhiyun  3
1484*4882a593Smuzhiyun
1485*4882a593SmuzhiyunOn the server, run nstat::
1486*4882a593Smuzhiyun
1487*4882a593Smuzhiyun  ubuntu@nstat-b:~$ nstat
1488*4882a593Smuzhiyun  #kernel
1489*4882a593Smuzhiyun  IpInReceives                    2                  0.0
1490*4882a593Smuzhiyun  IpInDelivers                    2                  0.0
1491*4882a593Smuzhiyun  IpOutRequests                   2                  0.0
1492*4882a593Smuzhiyun  TcpInSegs                       2                  0.0
1493*4882a593Smuzhiyun  TcpOutSegs                      2                  0.0
1494*4882a593Smuzhiyun  TcpExtTCPRcvCoalesce            1                  0.0
1495*4882a593Smuzhiyun  IpExtInOctets                   110                0.0
1496*4882a593Smuzhiyun  IpExtOutOctets                  104                0.0
1497*4882a593Smuzhiyun  IpExtInNoECTPkts                2                  0.0
1498*4882a593Smuzhiyun
1499*4882a593SmuzhiyunThe client sent two packets, server didn't read any data. When
1500*4882a593Smuzhiyunthe second packet arrived at server, the first packet was still in
1501*4882a593Smuzhiyunthe receiving queue. So the TCP layer merged the two packets, and we
1502*4882a593Smuzhiyuncould find the TcpExtTCPRcvCoalesce increased 1.
1503*4882a593Smuzhiyun
1504*4882a593SmuzhiyunTcpExtListenOverflows and TcpExtListenDrops
1505*4882a593Smuzhiyun-------------------------------------------
1506*4882a593SmuzhiyunOn server, run the nc command, listen on port 9000::
1507*4882a593Smuzhiyun
1508*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
1509*4882a593Smuzhiyun  Listening on [0.0.0.0] (family 0, port 9000)
1510*4882a593Smuzhiyun
1511*4882a593SmuzhiyunOn client, run 3 nc commands in different terminals::
1512*4882a593Smuzhiyun
1513*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1514*4882a593Smuzhiyun  Connection to nstat-b 9000 port [tcp/*] succeeded!
1515*4882a593Smuzhiyun
1516*4882a593SmuzhiyunThe nc command only accepts 1 connection, and the accept queue length
1517*4882a593Smuzhiyunis 1. On current linux implementation, set queue length to n means the
1518*4882a593Smuzhiyunactual queue length is n+1. Now we create 3 connections, 1 is accepted
1519*4882a593Smuzhiyunby nc, 2 in accepted queue, so the accept queue is full.
1520*4882a593Smuzhiyun
1521*4882a593SmuzhiyunBefore running the 4th nc, we clean the nstat history on the server::
1522*4882a593Smuzhiyun
1523*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nstat -n
1524*4882a593Smuzhiyun
1525*4882a593SmuzhiyunRun the 4th nc on the client::
1526*4882a593Smuzhiyun
1527*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1528*4882a593Smuzhiyun
1529*4882a593SmuzhiyunIf the nc server is running on kernel 4.10 or higher version, you
1530*4882a593Smuzhiyunwon't see the "Connection to ... succeeded!" string, because kernel
1531*4882a593Smuzhiyunwill drop the SYN if the accept queue is full. If the nc client is running
1532*4882a593Smuzhiyunon an old kernel, you would see that the connection is succeeded,
1533*4882a593Smuzhiyunbecause kernel would complete the 3 way handshake and keep the socket
1534*4882a593Smuzhiyunon half open queue. I did the test on kernel 4.15. Below is the nstat
1535*4882a593Smuzhiyunon the server::
1536*4882a593Smuzhiyun
1537*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nstat
1538*4882a593Smuzhiyun  #kernel
1539*4882a593Smuzhiyun  IpInReceives                    4                  0.0
1540*4882a593Smuzhiyun  IpInDelivers                    4                  0.0
1541*4882a593Smuzhiyun  TcpInSegs                       4                  0.0
1542*4882a593Smuzhiyun  TcpExtListenOverflows           4                  0.0
1543*4882a593Smuzhiyun  TcpExtListenDrops               4                  0.0
1544*4882a593Smuzhiyun  IpExtInOctets                   240                0.0
1545*4882a593Smuzhiyun  IpExtInNoECTPkts                4                  0.0
1546*4882a593Smuzhiyun
1547*4882a593SmuzhiyunBoth TcpExtListenOverflows and TcpExtListenDrops were 4. If the time
1548*4882a593Smuzhiyunbetween the 4th nc and the nstat was longer, the value of
1549*4882a593SmuzhiyunTcpExtListenOverflows and TcpExtListenDrops would be larger, because
1550*4882a593Smuzhiyunthe SYN of the 4th nc was dropped, the client was retrying.
1551*4882a593Smuzhiyun
1552*4882a593SmuzhiyunIpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes
1553*4882a593Smuzhiyun-------------------------------------------------
1554*4882a593Smuzhiyunserver A IP address: 192.168.122.250
1555*4882a593Smuzhiyunserver B IP address: 192.168.122.251
1556*4882a593SmuzhiyunPrepare on server A, add a route to server B::
1557*4882a593Smuzhiyun
1558*4882a593Smuzhiyun  $ sudo ip route add 8.8.8.8/32 via 192.168.122.251
1559*4882a593Smuzhiyun
1560*4882a593SmuzhiyunPrepare on server B, disable send_redirects for all interfaces::
1561*4882a593Smuzhiyun
1562*4882a593Smuzhiyun  $ sudo sysctl -w net.ipv4.conf.all.send_redirects=0
1563*4882a593Smuzhiyun  $ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0
1564*4882a593Smuzhiyun  $ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0
1565*4882a593Smuzhiyun  $ sudo sysctl -w net.ipv4.conf.default.send_redirects=0
1566*4882a593Smuzhiyun
1567*4882a593SmuzhiyunWe want to let sever A send a packet to 8.8.8.8, and route the packet
1568*4882a593Smuzhiyunto server B. When server B receives such packet, it might send a ICMP
1569*4882a593SmuzhiyunRedirect message to server A, set send_redirects to 0 will disable
1570*4882a593Smuzhiyunthis behavior.
1571*4882a593Smuzhiyun
1572*4882a593SmuzhiyunFirst, generate InAddrErrors. On server B, we disable IP forwarding::
1573*4882a593Smuzhiyun
1574*4882a593Smuzhiyun  $ sudo sysctl -w net.ipv4.conf.all.forwarding=0
1575*4882a593Smuzhiyun
1576*4882a593SmuzhiyunOn server A, we send packets to 8.8.8.8::
1577*4882a593Smuzhiyun
1578*4882a593Smuzhiyun  $ nc -v 8.8.8.8 53
1579*4882a593Smuzhiyun
1580*4882a593SmuzhiyunOn server B, we check the output of nstat::
1581*4882a593Smuzhiyun
1582*4882a593Smuzhiyun  $ nstat
1583*4882a593Smuzhiyun  #kernel
1584*4882a593Smuzhiyun  IpInReceives                    3                  0.0
1585*4882a593Smuzhiyun  IpInAddrErrors                  3                  0.0
1586*4882a593Smuzhiyun  IpExtInOctets                   180                0.0
1587*4882a593Smuzhiyun  IpExtInNoECTPkts                3                  0.0
1588*4882a593Smuzhiyun
1589*4882a593SmuzhiyunAs we have let server A route 8.8.8.8 to server B, and we disabled IP
1590*4882a593Smuzhiyunforwarding on server B, Server A sent packets to server B, then server B
1591*4882a593Smuzhiyundropped packets and increased IpInAddrErrors. As the nc command would
1592*4882a593Smuzhiyunre-send the SYN packet if it didn't receive a SYN+ACK, we could find
1593*4882a593Smuzhiyunmultiple IpInAddrErrors.
1594*4882a593Smuzhiyun
1595*4882a593SmuzhiyunSecond, generate IpExtInNoRoutes. On server B, we enable IP
1596*4882a593Smuzhiyunforwarding::
1597*4882a593Smuzhiyun
1598*4882a593Smuzhiyun  $ sudo sysctl -w net.ipv4.conf.all.forwarding=1
1599*4882a593Smuzhiyun
1600*4882a593SmuzhiyunCheck the route table of server B and remove the default route::
1601*4882a593Smuzhiyun
1602*4882a593Smuzhiyun  $ ip route show
1603*4882a593Smuzhiyun  default via 192.168.122.1 dev ens3 proto static
1604*4882a593Smuzhiyun  192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251
1605*4882a593Smuzhiyun  $ sudo ip route delete default via 192.168.122.1 dev ens3 proto static
1606*4882a593Smuzhiyun
1607*4882a593SmuzhiyunOn server A, we contact 8.8.8.8 again::
1608*4882a593Smuzhiyun
1609*4882a593Smuzhiyun  $ nc -v 8.8.8.8 53
1610*4882a593Smuzhiyun  nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable
1611*4882a593Smuzhiyun
1612*4882a593SmuzhiyunOn server B, run nstat::
1613*4882a593Smuzhiyun
1614*4882a593Smuzhiyun  $ nstat
1615*4882a593Smuzhiyun  #kernel
1616*4882a593Smuzhiyun  IpInReceives                    1                  0.0
1617*4882a593Smuzhiyun  IpOutRequests                   1                  0.0
1618*4882a593Smuzhiyun  IcmpOutMsgs                     1                  0.0
1619*4882a593Smuzhiyun  IcmpOutDestUnreachs             1                  0.0
1620*4882a593Smuzhiyun  IcmpMsgOutType3                 1                  0.0
1621*4882a593Smuzhiyun  IpExtInNoRoutes                 1                  0.0
1622*4882a593Smuzhiyun  IpExtInOctets                   60                 0.0
1623*4882a593Smuzhiyun  IpExtOutOctets                  88                 0.0
1624*4882a593Smuzhiyun  IpExtInNoECTPkts                1                  0.0
1625*4882a593Smuzhiyun
1626*4882a593SmuzhiyunWe enabled IP forwarding on server B, when server B received a packet
1627*4882a593Smuzhiyunwhich destination IP address is 8.8.8.8, server B will try to forward
1628*4882a593Smuzhiyunthis packet. We have deleted the default route, there was no route for
1629*4882a593Smuzhiyun8.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP
1630*4882a593SmuzhiyunDestination Unreachable" message to server A.
1631*4882a593Smuzhiyun
1632*4882a593SmuzhiyunThird, generate IpOutNoRoutes. Run ping command on server B::
1633*4882a593Smuzhiyun
1634*4882a593Smuzhiyun  $ ping -c 1 8.8.8.8
1635*4882a593Smuzhiyun  connect: Network is unreachable
1636*4882a593Smuzhiyun
1637*4882a593SmuzhiyunRun nstat on server B::
1638*4882a593Smuzhiyun
1639*4882a593Smuzhiyun  $ nstat
1640*4882a593Smuzhiyun  #kernel
1641*4882a593Smuzhiyun  IpOutNoRoutes                   1                  0.0
1642*4882a593Smuzhiyun
1643*4882a593SmuzhiyunWe have deleted the default route on server B. Server B couldn't find
1644*4882a593Smuzhiyuna route for the 8.8.8.8 IP address, so server B increased
1645*4882a593SmuzhiyunIpOutNoRoutes.
1646*4882a593Smuzhiyun
1647*4882a593SmuzhiyunTcpExtTCPACKSkippedSynRecv
1648*4882a593Smuzhiyun--------------------------
1649*4882a593SmuzhiyunIn this test, we send 3 same SYN packets from client to server. The
1650*4882a593Smuzhiyunfirst SYN will let server create a socket, set it to Syn-Recv status,
1651*4882a593Smuzhiyunand reply a SYN/ACK. The second SYN will let server reply the SYN/ACK
1652*4882a593Smuzhiyunagain, and record the reply time (the duplicate ACK reply time). The
1653*4882a593Smuzhiyunthird SYN will let server check the previous duplicate ACK reply time,
1654*4882a593Smuzhiyunand decide to skip the duplicate ACK, then increase the
1655*4882a593SmuzhiyunTcpExtTCPACKSkippedSynRecv counter.
1656*4882a593Smuzhiyun
1657*4882a593SmuzhiyunRun tcpdump to capture a SYN packet::
1658*4882a593Smuzhiyun
1659*4882a593Smuzhiyun  nstatuser@nstat-a:~$ sudo tcpdump -c 1 -w /tmp/syn.pcap port 9000
1660*4882a593Smuzhiyun  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
1661*4882a593Smuzhiyun
1662*4882a593SmuzhiyunOpen another terminal, run nc command::
1663*4882a593Smuzhiyun
1664*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc nstat-b 9000
1665*4882a593Smuzhiyun
1666*4882a593SmuzhiyunAs the nstat-b didn't listen on port 9000, it should reply a RST, and
1667*4882a593Smuzhiyunthe nc command exited immediately. It was enough for the tcpdump
1668*4882a593Smuzhiyuncommand to capture a SYN packet. A linux server might use hardware
1669*4882a593Smuzhiyunoffload for the TCP checksum, so the checksum in the /tmp/syn.pcap
1670*4882a593Smuzhiyunmight be not correct. We call tcprewrite to fix it::
1671*4882a593Smuzhiyun
1672*4882a593Smuzhiyun  nstatuser@nstat-a:~$ tcprewrite --infile=/tmp/syn.pcap --outfile=/tmp/syn_fixcsum.pcap --fixcsum
1673*4882a593Smuzhiyun
1674*4882a593SmuzhiyunOn nstat-b, we run nc to listen on port 9000::
1675*4882a593Smuzhiyun
1676*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nc -lkv 9000
1677*4882a593Smuzhiyun  Listening on [0.0.0.0] (family 0, port 9000)
1678*4882a593Smuzhiyun
1679*4882a593SmuzhiyunOn nstat-a, we blocked the packet from port 9000, or nstat-a would send
1680*4882a593SmuzhiyunRST to nstat-b::
1681*4882a593Smuzhiyun
1682*4882a593Smuzhiyun  nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP
1683*4882a593Smuzhiyun
1684*4882a593SmuzhiyunSend 3 SYN repeatly to nstat-b::
1685*4882a593Smuzhiyun
1686*4882a593Smuzhiyun  nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done
1687*4882a593Smuzhiyun
1688*4882a593SmuzhiyunCheck snmp cunter on nstat-b::
1689*4882a593Smuzhiyun
1690*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nstat | grep -i skip
1691*4882a593Smuzhiyun  TcpExtTCPACKSkippedSynRecv      1                  0.0
1692*4882a593Smuzhiyun
1693*4882a593SmuzhiyunAs we expected, TcpExtTCPACKSkippedSynRecv is 1.
1694*4882a593Smuzhiyun
1695*4882a593SmuzhiyunTcpExtTCPACKSkippedPAWS
1696*4882a593Smuzhiyun-----------------------
1697*4882a593SmuzhiyunTo trigger PAWS, we could send an old SYN.
1698*4882a593Smuzhiyun
1699*4882a593SmuzhiyunOn nstat-b, let nc listen on port 9000::
1700*4882a593Smuzhiyun
1701*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nc -lkv 9000
1702*4882a593Smuzhiyun  Listening on [0.0.0.0] (family 0, port 9000)
1703*4882a593Smuzhiyun
1704*4882a593SmuzhiyunOn nstat-a, run tcpdump to capture a SYN::
1705*4882a593Smuzhiyun
1706*4882a593Smuzhiyun  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/paws_pre.pcap -c 1 port 9000
1707*4882a593Smuzhiyun  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
1708*4882a593Smuzhiyun
1709*4882a593SmuzhiyunOn nstat-a, run nc as a client to connect nstat-b::
1710*4882a593Smuzhiyun
1711*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1712*4882a593Smuzhiyun  Connection to nstat-b 9000 port [tcp/*] succeeded!
1713*4882a593Smuzhiyun
1714*4882a593SmuzhiyunNow the tcpdump has captured the SYN and exit. We should fix the
1715*4882a593Smuzhiyunchecksum::
1716*4882a593Smuzhiyun
1717*4882a593Smuzhiyun  nstatuser@nstat-a:~$ tcprewrite --infile /tmp/paws_pre.pcap --outfile /tmp/paws.pcap --fixcsum
1718*4882a593Smuzhiyun
1719*4882a593SmuzhiyunSend the SYN packet twice::
1720*4882a593Smuzhiyun
1721*4882a593Smuzhiyun  nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/paws.pcap; done
1722*4882a593Smuzhiyun
1723*4882a593SmuzhiyunOn nstat-b, check the snmp counter::
1724*4882a593Smuzhiyun
1725*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nstat | grep -i skip
1726*4882a593Smuzhiyun  TcpExtTCPACKSkippedPAWS         1                  0.0
1727*4882a593Smuzhiyun
1728*4882a593SmuzhiyunWe sent two SYN via tcpreplay, both of them would let PAWS check
1729*4882a593Smuzhiyunfailed, the nstat-b replied an ACK for the first SYN, skipped the ACK
1730*4882a593Smuzhiyunfor the second SYN, and updated TcpExtTCPACKSkippedPAWS.
1731*4882a593Smuzhiyun
1732*4882a593SmuzhiyunTcpExtTCPACKSkippedSeq
1733*4882a593Smuzhiyun----------------------
1734*4882a593SmuzhiyunTo trigger TcpExtTCPACKSkippedSeq, we send packets which have valid
1735*4882a593Smuzhiyuntimestamp (to pass PAWS check) but the sequence number is out of
1736*4882a593Smuzhiyunwindow. The linux TCP stack would avoid to skip if the packet has
1737*4882a593Smuzhiyundata, so we need a pure ACK packet. To generate such a packet, we
1738*4882a593Smuzhiyuncould create two sockets: one on port 9000, another on port 9001. Then
1739*4882a593Smuzhiyunwe capture an ACK on port 9001, change the source/destination port
1740*4882a593Smuzhiyunnumbers to match the port 9000 socket. Then we could trigger
1741*4882a593SmuzhiyunTcpExtTCPACKSkippedSeq via this packet.
1742*4882a593Smuzhiyun
1743*4882a593SmuzhiyunOn nstat-b, open two terminals, run two nc commands to listen on both
1744*4882a593Smuzhiyunport 9000 and port 9001::
1745*4882a593Smuzhiyun
1746*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nc -lkv 9000
1747*4882a593Smuzhiyun  Listening on [0.0.0.0] (family 0, port 9000)
1748*4882a593Smuzhiyun
1749*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nc -lkv 9001
1750*4882a593Smuzhiyun  Listening on [0.0.0.0] (family 0, port 9001)
1751*4882a593Smuzhiyun
1752*4882a593SmuzhiyunOn nstat-a, run two nc clients::
1753*4882a593Smuzhiyun
1754*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1755*4882a593Smuzhiyun  Connection to nstat-b 9000 port [tcp/*] succeeded!
1756*4882a593Smuzhiyun
1757*4882a593Smuzhiyun  nstatuser@nstat-a:~$ nc -v nstat-b 9001
1758*4882a593Smuzhiyun  Connection to nstat-b 9001 port [tcp/*] succeeded!
1759*4882a593Smuzhiyun
1760*4882a593SmuzhiyunOn nstat-a, run tcpdump to capture an ACK::
1761*4882a593Smuzhiyun
1762*4882a593Smuzhiyun  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/seq_pre.pcap -c 1 dst port 9001
1763*4882a593Smuzhiyun  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
1764*4882a593Smuzhiyun
1765*4882a593SmuzhiyunOn nstat-b, send a packet via the port 9001 socket. E.g. we sent a
1766*4882a593Smuzhiyunstring 'foo' in our example::
1767*4882a593Smuzhiyun
1768*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nc -lkv 9001
1769*4882a593Smuzhiyun  Listening on [0.0.0.0] (family 0, port 9001)
1770*4882a593Smuzhiyun  Connection from nstat-a 42132 received!
1771*4882a593Smuzhiyun  foo
1772*4882a593Smuzhiyun
1773*4882a593SmuzhiyunOn nstat-a, the tcpdump should have caputred the ACK. We should check
1774*4882a593Smuzhiyunthe source port numbers of the two nc clients::
1775*4882a593Smuzhiyun
1776*4882a593Smuzhiyun  nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee
1777*4882a593Smuzhiyun  State  Recv-Q   Send-Q         Local Address:Port           Peer Address:Port
1778*4882a593Smuzhiyun  ESTAB  0        0            192.168.122.250:50208       192.168.122.251:9000
1779*4882a593Smuzhiyun  ESTAB  0        0            192.168.122.250:42132       192.168.122.251:9001
1780*4882a593Smuzhiyun
1781*4882a593SmuzhiyunRun tcprewrite, change port 9001 to port 9000, chagne port 42132 to
1782*4882a593Smuzhiyunport 50208::
1783*4882a593Smuzhiyun
1784*4882a593Smuzhiyun  nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum
1785*4882a593Smuzhiyun
1786*4882a593SmuzhiyunNow the /tmp/seq.pcap is the packet we need. Send it to nstat-b::
1787*4882a593Smuzhiyun
1788*4882a593Smuzhiyun  nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/seq.pcap; done
1789*4882a593Smuzhiyun
1790*4882a593SmuzhiyunCheck TcpExtTCPACKSkippedSeq on nstat-b::
1791*4882a593Smuzhiyun
1792*4882a593Smuzhiyun  nstatuser@nstat-b:~$ nstat | grep -i skip
1793*4882a593Smuzhiyun  TcpExtTCPACKSkippedSeq          1                  0.0
1794