xref: /OK3568_Linux_fs/kernel/Documentation/networking/timestamping.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun============
4*4882a593SmuzhiyunTimestamping
5*4882a593Smuzhiyun============
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun
8*4882a593Smuzhiyun1. Control Interfaces
9*4882a593Smuzhiyun=====================
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunThe interfaces for receiving network packages timestamps are:
12*4882a593Smuzhiyun
13*4882a593SmuzhiyunSO_TIMESTAMP
14*4882a593Smuzhiyun  Generates a timestamp for each incoming packet in (not necessarily
15*4882a593Smuzhiyun  monotonic) system time. Reports the timestamp via recvmsg() in a
16*4882a593Smuzhiyun  control message in usec resolution.
17*4882a593Smuzhiyun  SO_TIMESTAMP is defined as SO_TIMESTAMP_NEW or SO_TIMESTAMP_OLD
18*4882a593Smuzhiyun  based on the architecture type and time_t representation of libc.
19*4882a593Smuzhiyun  Control message format is in struct __kernel_old_timeval for
20*4882a593Smuzhiyun  SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for
21*4882a593Smuzhiyun  SO_TIMESTAMP_NEW options respectively.
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunSO_TIMESTAMPNS
24*4882a593Smuzhiyun  Same timestamping mechanism as SO_TIMESTAMP, but reports the
25*4882a593Smuzhiyun  timestamp as struct timespec in nsec resolution.
26*4882a593Smuzhiyun  SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD
27*4882a593Smuzhiyun  based on the architecture type and time_t representation of libc.
28*4882a593Smuzhiyun  Control message format is in struct timespec for SO_TIMESTAMPNS_OLD
29*4882a593Smuzhiyun  and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options
30*4882a593Smuzhiyun  respectively.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunIP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
33*4882a593Smuzhiyun  Only for multicast:approximate transmit timestamp obtained by
34*4882a593Smuzhiyun  reading the looped packet receive timestamp.
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunSO_TIMESTAMPING
37*4882a593Smuzhiyun  Generates timestamps on reception, transmission or both. Supports
38*4882a593Smuzhiyun  multiple timestamp sources, including hardware. Supports generating
39*4882a593Smuzhiyun  timestamps for stream sockets.
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun
42*4882a593Smuzhiyun1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW)
43*4882a593Smuzhiyun-------------------------------------------------------------
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunThis socket option enables timestamping of datagrams on the reception
46*4882a593Smuzhiyunpath. Because the destination socket, if any, is not known early in
47*4882a593Smuzhiyunthe network stack, the feature has to be enabled for all packets. The
48*4882a593Smuzhiyunsame is true for all early receive timestamp options.
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunFor interface details, see `man 7 socket`.
51*4882a593Smuzhiyun
52*4882a593SmuzhiyunAlways use SO_TIMESTAMP_NEW timestamp to always get timestamp in
53*4882a593Smuzhiyunstruct __kernel_sock_timeval format.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunSO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038
56*4882a593Smuzhiyunon 32 bit machines.
57*4882a593Smuzhiyun
58*4882a593Smuzhiyun1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW):
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunThis option is identical to SO_TIMESTAMP except for the returned data type.
61*4882a593SmuzhiyunIts struct timespec allows for higher resolution (ns) timestamps than the
62*4882a593Smuzhiyuntimeval of SO_TIMESTAMP (ms).
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunAlways use SO_TIMESTAMPNS_NEW timestamp to always get timestamp in
65*4882a593Smuzhiyunstruct __kernel_timespec format.
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunSO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038
68*4882a593Smuzhiyunon 32 bit machines.
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW)
71*4882a593Smuzhiyun----------------------------------------------------------------------
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunSupports multiple types of timestamp requests. As a result, this
74*4882a593Smuzhiyunsocket option takes a bitmap of flags, not a boolean. In::
75*4882a593Smuzhiyun
76*4882a593Smuzhiyun  err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
77*4882a593Smuzhiyun
78*4882a593Smuzhiyunval is an integer with any of the following bits set. Setting other
79*4882a593Smuzhiyunbit returns EINVAL and does not change the current state.
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunThe socket option configures timestamp generation for individual
82*4882a593Smuzhiyunsk_buffs (1.3.1), timestamp reporting to the socket's error
83*4882a593Smuzhiyunqueue (1.3.2) and options (1.3.3). Timestamp generation can also
84*4882a593Smuzhiyunbe enabled for individual sendmsg calls using cmsg (1.3.4).
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun
87*4882a593Smuzhiyun1.3.1 Timestamp Generation
88*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunSome bits are requests to the stack to try to generate timestamps. Any
91*4882a593Smuzhiyuncombination of them is valid. Changes to these bits apply to newly
92*4882a593Smuzhiyuncreated packets, not to packets already in the stack. As a result, it
93*4882a593Smuzhiyunis possible to selectively request timestamps for a subset of packets
94*4882a593Smuzhiyun(e.g., for sampling) by embedding an send() call within two setsockopt
95*4882a593Smuzhiyuncalls, one to enable timestamp generation and one to disable it.
96*4882a593SmuzhiyunTimestamps may also be generated for reasons other than being
97*4882a593Smuzhiyunrequested by a particular socket, such as when receive timestamping is
98*4882a593Smuzhiyunenabled system wide, as explained earlier.
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunSOF_TIMESTAMPING_RX_HARDWARE:
101*4882a593Smuzhiyun  Request rx timestamps generated by the network adapter.
102*4882a593Smuzhiyun
103*4882a593SmuzhiyunSOF_TIMESTAMPING_RX_SOFTWARE:
104*4882a593Smuzhiyun  Request rx timestamps when data enters the kernel. These timestamps
105*4882a593Smuzhiyun  are generated just after a device driver hands a packet to the
106*4882a593Smuzhiyun  kernel receive stack.
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_HARDWARE:
109*4882a593Smuzhiyun  Request tx timestamps generated by the network adapter. This flag
110*4882a593Smuzhiyun  can be enabled via both socket options and control messages.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_SOFTWARE:
113*4882a593Smuzhiyun  Request tx timestamps when data leaves the kernel. These timestamps
114*4882a593Smuzhiyun  are generated in the device driver as close as possible, but always
115*4882a593Smuzhiyun  prior to, passing the packet to the network interface. Hence, they
116*4882a593Smuzhiyun  require driver support and may not be available for all devices.
117*4882a593Smuzhiyun  This flag can be enabled via both socket options and control messages.
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_SCHED:
120*4882a593Smuzhiyun  Request tx timestamps prior to entering the packet scheduler. Kernel
121*4882a593Smuzhiyun  transmit latency is, if long, often dominated by queuing delay. The
122*4882a593Smuzhiyun  difference between this timestamp and one taken at
123*4882a593Smuzhiyun  SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent
124*4882a593Smuzhiyun  of protocol processing. The latency incurred in protocol
125*4882a593Smuzhiyun  processing, if any, can be computed by subtracting a userspace
126*4882a593Smuzhiyun  timestamp taken immediately before send() from this timestamp. On
127*4882a593Smuzhiyun  machines with virtual devices where a transmitted packet travels
128*4882a593Smuzhiyun  through multiple devices and, hence, multiple packet schedulers,
129*4882a593Smuzhiyun  a timestamp is generated at each layer. This allows for fine
130*4882a593Smuzhiyun  grained measurement of queuing delay. This flag can be enabled
131*4882a593Smuzhiyun  via both socket options and control messages.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_ACK:
134*4882a593Smuzhiyun  Request tx timestamps when all data in the send buffer has been
135*4882a593Smuzhiyun  acknowledged. This only makes sense for reliable protocols. It is
136*4882a593Smuzhiyun  currently only implemented for TCP. For that protocol, it may
137*4882a593Smuzhiyun  over-report measurement, because the timestamp is generated when all
138*4882a593Smuzhiyun  data up to and including the buffer at send() was acknowledged: the
139*4882a593Smuzhiyun  cumulative acknowledgment. The mechanism ignores SACK and FACK.
140*4882a593Smuzhiyun  This flag can be enabled via both socket options and control messages.
141*4882a593Smuzhiyun
142*4882a593Smuzhiyun
143*4882a593Smuzhiyun1.3.2 Timestamp Reporting
144*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunThe other three bits control which timestamps will be reported in a
147*4882a593Smuzhiyungenerated control message. Changes to the bits take immediate
148*4882a593Smuzhiyuneffect at the timestamp reporting locations in the stack. Timestamps
149*4882a593Smuzhiyunare only reported for packets that also have the relevant timestamp
150*4882a593Smuzhiyungeneration request set.
151*4882a593Smuzhiyun
152*4882a593SmuzhiyunSOF_TIMESTAMPING_SOFTWARE:
153*4882a593Smuzhiyun  Report any software timestamps when available.
154*4882a593Smuzhiyun
155*4882a593SmuzhiyunSOF_TIMESTAMPING_SYS_HARDWARE:
156*4882a593Smuzhiyun  This option is deprecated and ignored.
157*4882a593Smuzhiyun
158*4882a593SmuzhiyunSOF_TIMESTAMPING_RAW_HARDWARE:
159*4882a593Smuzhiyun  Report hardware timestamps as generated by
160*4882a593Smuzhiyun  SOF_TIMESTAMPING_TX_HARDWARE when available.
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun1.3.3 Timestamp Options
164*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunThe interface supports the options
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_ID:
169*4882a593Smuzhiyun  Generate a unique identifier along with each packet. A process can
170*4882a593Smuzhiyun  have multiple concurrent timestamping requests outstanding. Packets
171*4882a593Smuzhiyun  can be reordered in the transmit path, for instance in the packet
172*4882a593Smuzhiyun  scheduler. In that case timestamps will be queued onto the error
173*4882a593Smuzhiyun  queue out of order from the original send() calls. It is not always
174*4882a593Smuzhiyun  possible to uniquely match timestamps to the original send() calls
175*4882a593Smuzhiyun  based on timestamp order or payload inspection alone, then.
176*4882a593Smuzhiyun
177*4882a593Smuzhiyun  This option associates each packet at send() with a unique
178*4882a593Smuzhiyun  identifier and returns that along with the timestamp. The identifier
179*4882a593Smuzhiyun  is derived from a per-socket u32 counter (that wraps). For datagram
180*4882a593Smuzhiyun  sockets, the counter increments with each sent packet. For stream
181*4882a593Smuzhiyun  sockets, it increments with every byte.
182*4882a593Smuzhiyun
183*4882a593Smuzhiyun  The counter starts at zero. It is initialized the first time that
184*4882a593Smuzhiyun  the socket option is enabled. It is reset each time the option is
185*4882a593Smuzhiyun  enabled after having been disabled. Resetting the counter does not
186*4882a593Smuzhiyun  change the identifiers of existing packets in the system.
187*4882a593Smuzhiyun
188*4882a593Smuzhiyun  This option is implemented only for transmit timestamps. There, the
189*4882a593Smuzhiyun  timestamp is always looped along with a struct sock_extended_err.
190*4882a593Smuzhiyun  The option modifies field ee_data to pass an id that is unique
191*4882a593Smuzhiyun  among all possibly concurrently outstanding timestamp requests for
192*4882a593Smuzhiyun  that socket.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_CMSG:
196*4882a593Smuzhiyun  Support recv() cmsg for all timestamped packets. Control messages
197*4882a593Smuzhiyun  are already supported unconditionally on all packets with receive
198*4882a593Smuzhiyun  timestamps and on IPv6 packets with transmit timestamp. This option
199*4882a593Smuzhiyun  extends them to IPv4 packets with transmit timestamp. One use case
200*4882a593Smuzhiyun  is to correlate packets with their egress device, by enabling socket
201*4882a593Smuzhiyun  option IP_PKTINFO simultaneously.
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_TSONLY:
205*4882a593Smuzhiyun  Applies to transmit timestamps only. Makes the kernel return the
206*4882a593Smuzhiyun  timestamp as a cmsg alongside an empty packet, as opposed to
207*4882a593Smuzhiyun  alongside the original packet. This reduces the amount of memory
208*4882a593Smuzhiyun  charged to the socket's receive budget (SO_RCVBUF) and delivers
209*4882a593Smuzhiyun  the timestamp even if sysctl net.core.tstamp_allow_data is 0.
210*4882a593Smuzhiyun  This option disables SOF_TIMESTAMPING_OPT_CMSG.
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_STATS:
213*4882a593Smuzhiyun  Optional stats that are obtained along with the transmit timestamps.
214*4882a593Smuzhiyun  It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the
215*4882a593Smuzhiyun  transmit timestamp is available, the stats are available in a
216*4882a593Smuzhiyun  separate control message of type SCM_TIMESTAMPING_OPT_STATS, as a
217*4882a593Smuzhiyun  list of TLVs (struct nlattr) of types. These stats allow the
218*4882a593Smuzhiyun  application to associate various transport layer stats with
219*4882a593Smuzhiyun  the transmit timestamps, such as how long a certain block of
220*4882a593Smuzhiyun  data was limited by peer's receiver window.
221*4882a593Smuzhiyun
222*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_PKTINFO:
223*4882a593Smuzhiyun  Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming
224*4882a593Smuzhiyun  packets with hardware timestamps. The message contains struct
225*4882a593Smuzhiyun  scm_ts_pktinfo, which supplies the index of the real interface which
226*4882a593Smuzhiyun  received the packet and its length at layer 2. A valid (non-zero)
227*4882a593Smuzhiyun  interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is
228*4882a593Smuzhiyun  enabled and the driver is using NAPI. The struct contains also two
229*4882a593Smuzhiyun  other fields, but they are reserved and undefined.
230*4882a593Smuzhiyun
231*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_TX_SWHW:
232*4882a593Smuzhiyun  Request both hardware and software timestamps for outgoing packets
233*4882a593Smuzhiyun  when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE
234*4882a593Smuzhiyun  are enabled at the same time. If both timestamps are generated,
235*4882a593Smuzhiyun  two separate messages will be looped to the socket's error queue,
236*4882a593Smuzhiyun  each containing just one timestamp.
237*4882a593Smuzhiyun
238*4882a593SmuzhiyunNew applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to
239*4882a593Smuzhiyundisambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate
240*4882a593Smuzhiyunregardless of the setting of sysctl net.core.tstamp_allow_data.
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunAn exception is when a process needs additional cmsg data, for
243*4882a593Smuzhiyuninstance SOL_IP/IP_PKTINFO to detect the egress network interface.
244*4882a593SmuzhiyunThen pass option SOF_TIMESTAMPING_OPT_CMSG. This option depends on
245*4882a593Smuzhiyunhaving access to the contents of the original packet, so cannot be
246*4882a593Smuzhiyuncombined with SOF_TIMESTAMPING_OPT_TSONLY.
247*4882a593Smuzhiyun
248*4882a593Smuzhiyun
249*4882a593Smuzhiyun1.3.4. Enabling timestamps via control messages
250*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
251*4882a593Smuzhiyun
252*4882a593SmuzhiyunIn addition to socket options, timestamp generation can be requested
253*4882a593Smuzhiyunper write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1).
254*4882a593SmuzhiyunUsing this feature, applications can sample timestamps per sendmsg()
255*4882a593Smuzhiyunwithout paying the overhead of enabling and disabling timestamps via
256*4882a593Smuzhiyunsetsockopt::
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun  struct msghdr *msg;
259*4882a593Smuzhiyun  ...
260*4882a593Smuzhiyun  cmsg			       = CMSG_FIRSTHDR(msg);
261*4882a593Smuzhiyun  cmsg->cmsg_level	       = SOL_SOCKET;
262*4882a593Smuzhiyun  cmsg->cmsg_type	       = SO_TIMESTAMPING;
263*4882a593Smuzhiyun  cmsg->cmsg_len	       = CMSG_LEN(sizeof(__u32));
264*4882a593Smuzhiyun  *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED |
265*4882a593Smuzhiyun				 SOF_TIMESTAMPING_TX_SOFTWARE |
266*4882a593Smuzhiyun				 SOF_TIMESTAMPING_TX_ACK;
267*4882a593Smuzhiyun  err = sendmsg(fd, msg, 0);
268*4882a593Smuzhiyun
269*4882a593SmuzhiyunThe SOF_TIMESTAMPING_TX_* flags set via cmsg will override
270*4882a593Smuzhiyunthe SOF_TIMESTAMPING_TX_* flags set via setsockopt.
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunMoreover, applications must still enable timestamp reporting via
273*4882a593Smuzhiyunsetsockopt to receive timestamps::
274*4882a593Smuzhiyun
275*4882a593Smuzhiyun  __u32 val = SOF_TIMESTAMPING_SOFTWARE |
276*4882a593Smuzhiyun	      SOF_TIMESTAMPING_OPT_ID /* or any other flag */;
277*4882a593Smuzhiyun  err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
278*4882a593Smuzhiyun
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun1.4 Bytestream Timestamps
281*4882a593Smuzhiyun-------------------------
282*4882a593Smuzhiyun
283*4882a593SmuzhiyunThe SO_TIMESTAMPING interface supports timestamping of bytes in a
284*4882a593Smuzhiyunbytestream. Each request is interpreted as a request for when the
285*4882a593Smuzhiyunentire contents of the buffer has passed a timestamping point. That
286*4882a593Smuzhiyunis, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record
287*4882a593Smuzhiyunwhen all bytes have reached the device driver, regardless of how
288*4882a593Smuzhiyunmany packets the data has been converted into.
289*4882a593Smuzhiyun
290*4882a593SmuzhiyunIn general, bytestreams have no natural delimiters and therefore
291*4882a593Smuzhiyuncorrelating a timestamp with data is non-trivial. A range of bytes
292*4882a593Smuzhiyunmay be split across segments, any segments may be merged (possibly
293*4882a593Smuzhiyuncoalescing sections of previously segmented buffers associated with
294*4882a593Smuzhiyunindependent send() calls). Segments can be reordered and the same
295*4882a593Smuzhiyunbyte range can coexist in multiple segments for protocols that
296*4882a593Smuzhiyunimplement retransmissions.
297*4882a593Smuzhiyun
298*4882a593SmuzhiyunIt is essential that all timestamps implement the same semantics,
299*4882a593Smuzhiyunregardless of these possible transformations, as otherwise they are
300*4882a593Smuzhiyunincomparable. Handling "rare" corner cases differently from the
301*4882a593Smuzhiyunsimple case (a 1:1 mapping from buffer to skb) is insufficient
302*4882a593Smuzhiyunbecause performance debugging often needs to focus on such outliers.
303*4882a593Smuzhiyun
304*4882a593SmuzhiyunIn practice, timestamps can be correlated with segments of a
305*4882a593Smuzhiyunbytestream consistently, if both semantics of the timestamp and the
306*4882a593Smuzhiyuntiming of measurement are chosen correctly. This challenge is no
307*4882a593Smuzhiyundifferent from deciding on a strategy for IP fragmentation. There, the
308*4882a593Smuzhiyundefinition is that only the first fragment is timestamped. For
309*4882a593Smuzhiyunbytestreams, we chose that a timestamp is generated only when all
310*4882a593Smuzhiyunbytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to
311*4882a593Smuzhiyunimplement and reason about. An implementation that has to take into
312*4882a593Smuzhiyunaccount SACK would be more complex due to possible transmission holes
313*4882a593Smuzhiyunand out of order arrival.
314*4882a593Smuzhiyun
315*4882a593SmuzhiyunOn the host, TCP can also break the simple 1:1 mapping from buffer to
316*4882a593Smuzhiyunskbuff as a result of Nagle, cork, autocork, segmentation and GSO. The
317*4882a593Smuzhiyunimplementation ensures correctness in all cases by tracking the
318*4882a593Smuzhiyunindividual last byte passed to send(), even if it is no longer the
319*4882a593Smuzhiyunlast byte after an skbuff extend or merge operation. It stores the
320*4882a593Smuzhiyunrelevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff
321*4882a593Smuzhiyunhas only one such field, only one timestamp can be generated.
322*4882a593Smuzhiyun
323*4882a593SmuzhiyunIn rare cases, a timestamp request can be missed if two requests are
324*4882a593Smuzhiyuncollapsed onto the same skb. A process can detect this situation by
325*4882a593Smuzhiyunenabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at
326*4882a593Smuzhiyunsend time with the value returned for each timestamp. It can prevent
327*4882a593Smuzhiyunthe situation by always flushing the TCP stack in between requests,
328*4882a593Smuzhiyunfor instance by enabling TCP_NODELAY and disabling TCP_CORK and
329*4882a593Smuzhiyunautocork.
330*4882a593Smuzhiyun
331*4882a593SmuzhiyunThese precautions ensure that the timestamp is generated only when all
332*4882a593Smuzhiyunbytes have passed a timestamp point, assuming that the network stack
333*4882a593Smuzhiyunitself does not reorder the segments. The stack indeed tries to avoid
334*4882a593Smuzhiyunreordering. The one exception is under administrator control: it is
335*4882a593Smuzhiyunpossible to construct a packet scheduler configuration that delays
336*4882a593Smuzhiyunsegments from the same stream differently. Such a setup would be
337*4882a593Smuzhiyununusual.
338*4882a593Smuzhiyun
339*4882a593Smuzhiyun
340*4882a593Smuzhiyun2 Data Interfaces
341*4882a593Smuzhiyun==================
342*4882a593Smuzhiyun
343*4882a593SmuzhiyunTimestamps are read using the ancillary data feature of recvmsg().
344*4882a593SmuzhiyunSee `man 3 cmsg` for details of this interface. The socket manual
345*4882a593Smuzhiyunpage (`man 7 socket`) describes how timestamps generated with
346*4882a593SmuzhiyunSO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.
347*4882a593Smuzhiyun
348*4882a593Smuzhiyun
349*4882a593Smuzhiyun2.1 SCM_TIMESTAMPING records
350*4882a593Smuzhiyun----------------------------
351*4882a593Smuzhiyun
352*4882a593SmuzhiyunThese timestamps are returned in a control message with cmsg_level
353*4882a593SmuzhiyunSOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
354*4882a593Smuzhiyun
355*4882a593SmuzhiyunFor SO_TIMESTAMPING_OLD::
356*4882a593Smuzhiyun
357*4882a593Smuzhiyun	struct scm_timestamping {
358*4882a593Smuzhiyun		struct timespec ts[3];
359*4882a593Smuzhiyun	};
360*4882a593Smuzhiyun
361*4882a593SmuzhiyunFor SO_TIMESTAMPING_NEW::
362*4882a593Smuzhiyun
363*4882a593Smuzhiyun	struct scm_timestamping64 {
364*4882a593Smuzhiyun		struct __kernel_timespec ts[3];
365*4882a593Smuzhiyun
366*4882a593SmuzhiyunAlways use SO_TIMESTAMPING_NEW timestamp to always get timestamp in
367*4882a593Smuzhiyunstruct scm_timestamping64 format.
368*4882a593Smuzhiyun
369*4882a593SmuzhiyunSO_TIMESTAMPING_OLD returns incorrect timestamps after the year 2038
370*4882a593Smuzhiyunon 32 bit machines.
371*4882a593Smuzhiyun
372*4882a593SmuzhiyunThe structure can return up to three timestamps. This is a legacy
373*4882a593Smuzhiyunfeature. At least one field is non-zero at any time. Most timestamps
374*4882a593Smuzhiyunare passed in ts[0]. Hardware timestamps are passed in ts[2].
375*4882a593Smuzhiyun
376*4882a593Smuzhiyunts[1] used to hold hardware timestamps converted to system time.
377*4882a593SmuzhiyunInstead, expose the hardware clock device on the NIC directly as
378*4882a593Smuzhiyuna HW PTP clock source, to allow time conversion in userspace and
379*4882a593Smuzhiyunoptionally synchronize system time with a userspace PTP stack such
380*4882a593Smuzhiyunas linuxptp. For the PTP clock API, see Documentation/driver-api/ptp.rst.
381*4882a593Smuzhiyun
382*4882a593SmuzhiyunNote that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled
383*4882a593Smuzhiyuntogether with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false
384*4882a593Smuzhiyunsoftware timestamp will be generated in the recvmsg() call and passed
385*4882a593Smuzhiyunin ts[0] when a real software timestamp is missing. This happens also
386*4882a593Smuzhiyunon hardware transmit timestamps.
387*4882a593Smuzhiyun
388*4882a593Smuzhiyun2.1.1 Transmit timestamps with MSG_ERRQUEUE
389*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
390*4882a593Smuzhiyun
391*4882a593SmuzhiyunFor transmit timestamps the outgoing packet is looped back to the
392*4882a593Smuzhiyunsocket's error queue with the send timestamp(s) attached. A process
393*4882a593Smuzhiyunreceives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE
394*4882a593Smuzhiyunset and with a msg_control buffer sufficiently large to receive the
395*4882a593Smuzhiyunrelevant metadata structures. The recvmsg call returns the original
396*4882a593Smuzhiyunoutgoing data packet with two ancillary messages attached.
397*4882a593Smuzhiyun
398*4882a593SmuzhiyunA message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR
399*4882a593Smuzhiyunembeds a struct sock_extended_err. This defines the error type. For
400*4882a593Smuzhiyuntimestamps, the ee_errno field is ENOMSG. The other ancillary message
401*4882a593Smuzhiyunwill have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This
402*4882a593Smuzhiyunembeds the struct scm_timestamping.
403*4882a593Smuzhiyun
404*4882a593Smuzhiyun
405*4882a593Smuzhiyun2.1.1.2 Timestamp types
406*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~
407*4882a593Smuzhiyun
408*4882a593SmuzhiyunThe semantics of the three struct timespec are defined by field
409*4882a593Smuzhiyunee_info in the extended error structure. It contains a value of
410*4882a593Smuzhiyuntype SCM_TSTAMP_* to define the actual timestamp passed in
411*4882a593Smuzhiyunscm_timestamping.
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunThe SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_*
414*4882a593Smuzhiyuncontrol fields discussed previously, with one exception. For legacy
415*4882a593Smuzhiyunreasons, SCM_TSTAMP_SND is equal to zero and can be set for both
416*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It
417*4882a593Smuzhiyunis the first if ts[2] is non-zero, the second otherwise, in which
418*4882a593Smuzhiyuncase the timestamp is stored in ts[0].
419*4882a593Smuzhiyun
420*4882a593Smuzhiyun
421*4882a593Smuzhiyun2.1.1.3 Fragmentation
422*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~
423*4882a593Smuzhiyun
424*4882a593SmuzhiyunFragmentation of outgoing datagrams is rare, but is possible, e.g., by
425*4882a593Smuzhiyunexplicitly disabling PMTU discovery. If an outgoing packet is fragmented,
426*4882a593Smuzhiyunthen only the first fragment is timestamped and returned to the sending
427*4882a593Smuzhiyunsocket.
428*4882a593Smuzhiyun
429*4882a593Smuzhiyun
430*4882a593Smuzhiyun2.1.1.4 Packet Payload
431*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~
432*4882a593Smuzhiyun
433*4882a593SmuzhiyunThe calling application is often not interested in receiving the whole
434*4882a593Smuzhiyunpacket payload that it passed to the stack originally: the socket
435*4882a593Smuzhiyunerror queue mechanism is just a method to piggyback the timestamp on.
436*4882a593SmuzhiyunIn this case, the application can choose to read datagrams with a
437*4882a593Smuzhiyunsmaller buffer, possibly even of length 0. The payload is truncated
438*4882a593Smuzhiyunaccordingly. Until the process calls recvmsg() on the error queue,
439*4882a593Smuzhiyunhowever, the full packet is queued, taking up budget from SO_RCVBUF.
440*4882a593Smuzhiyun
441*4882a593Smuzhiyun
442*4882a593Smuzhiyun2.1.1.5 Blocking Read
443*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~
444*4882a593Smuzhiyun
445*4882a593SmuzhiyunReading from the error queue is always a non-blocking operation. To
446*4882a593Smuzhiyunblock waiting on a timestamp, use poll or select. poll() will return
447*4882a593SmuzhiyunPOLLERR in pollfd.revents if any data is ready on the error queue.
448*4882a593SmuzhiyunThere is no need to pass this flag in pollfd.events. This flag is
449*4882a593Smuzhiyunignored on request. See also `man 2 poll`.
450*4882a593Smuzhiyun
451*4882a593Smuzhiyun
452*4882a593Smuzhiyun2.1.2 Receive timestamps
453*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^
454*4882a593Smuzhiyun
455*4882a593SmuzhiyunOn reception, there is no reason to read from the socket error queue.
456*4882a593SmuzhiyunThe SCM_TIMESTAMPING ancillary data is sent along with the packet data
457*4882a593Smuzhiyunon a normal recvmsg(). Since this is not a socket error, it is not
458*4882a593Smuzhiyunaccompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case,
459*4882a593Smuzhiyunthe meaning of the three fields in struct scm_timestamping is
460*4882a593Smuzhiyunimplicitly defined. ts[0] holds a software timestamp if set, ts[1]
461*4882a593Smuzhiyunis again deprecated and ts[2] holds a hardware timestamp if set.
462*4882a593Smuzhiyun
463*4882a593Smuzhiyun
464*4882a593Smuzhiyun3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
465*4882a593Smuzhiyun=======================================================================
466*4882a593Smuzhiyun
467*4882a593SmuzhiyunHardware time stamping must also be initialized for each device driver
468*4882a593Smuzhiyunthat is expected to do hardware time stamping. The parameter is defined in
469*4882a593Smuzhiyuninclude/uapi/linux/net_tstamp.h as::
470*4882a593Smuzhiyun
471*4882a593Smuzhiyun	struct hwtstamp_config {
472*4882a593Smuzhiyun		int flags;	/* no flags defined right now, must be zero */
473*4882a593Smuzhiyun		int tx_type;	/* HWTSTAMP_TX_* */
474*4882a593Smuzhiyun		int rx_filter;	/* HWTSTAMP_FILTER_* */
475*4882a593Smuzhiyun	};
476*4882a593Smuzhiyun
477*4882a593SmuzhiyunDesired behavior is passed into the kernel and to a specific device by
478*4882a593Smuzhiyuncalling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
479*4882a593Smuzhiyunifr_data points to a struct hwtstamp_config. The tx_type and
480*4882a593Smuzhiyunrx_filter are hints to the driver what it is expected to do. If
481*4882a593Smuzhiyunthe requested fine-grained filtering for incoming packets is not
482*4882a593Smuzhiyunsupported, the driver may time stamp more than just the requested types
483*4882a593Smuzhiyunof packets.
484*4882a593Smuzhiyun
485*4882a593SmuzhiyunDrivers are free to use a more permissive configuration than the requested
486*4882a593Smuzhiyunconfiguration. It is expected that drivers should only implement directly the
487*4882a593Smuzhiyunmost generic mode that can be supported. For example if the hardware can
488*4882a593Smuzhiyunsupport HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale
489*4882a593SmuzhiyunHWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT
490*4882a593Smuzhiyunis more generic (and more useful to applications).
491*4882a593Smuzhiyun
492*4882a593SmuzhiyunA driver which supports hardware time stamping shall update the struct
493*4882a593Smuzhiyunwith the actual, possibly more permissive configuration. If the
494*4882a593Smuzhiyunrequested packets cannot be time stamped, then nothing should be
495*4882a593Smuzhiyunchanged and ERANGE shall be returned (in contrast to EINVAL, which
496*4882a593Smuzhiyunindicates that SIOCSHWTSTAMP is not supported at all).
497*4882a593Smuzhiyun
498*4882a593SmuzhiyunOnly a processes with admin rights may change the configuration. User
499*4882a593Smuzhiyunspace is responsible to ensure that multiple processes don't interfere
500*4882a593Smuzhiyunwith each other and that the settings are reset.
501*4882a593Smuzhiyun
502*4882a593SmuzhiyunAny process can read the actual configuration by passing this
503*4882a593Smuzhiyunstructure to ioctl(SIOCGHWTSTAMP) in the same way.  However, this has
504*4882a593Smuzhiyunnot been implemented in all drivers.
505*4882a593Smuzhiyun
506*4882a593Smuzhiyun::
507*4882a593Smuzhiyun
508*4882a593Smuzhiyun    /* possible values for hwtstamp_config->tx_type */
509*4882a593Smuzhiyun    enum {
510*4882a593Smuzhiyun	    /*
511*4882a593Smuzhiyun	    * no outgoing packet will need hardware time stamping;
512*4882a593Smuzhiyun	    * should a packet arrive which asks for it, no hardware
513*4882a593Smuzhiyun	    * time stamping will be done
514*4882a593Smuzhiyun	    */
515*4882a593Smuzhiyun	    HWTSTAMP_TX_OFF,
516*4882a593Smuzhiyun
517*4882a593Smuzhiyun	    /*
518*4882a593Smuzhiyun	    * enables hardware time stamping for outgoing packets;
519*4882a593Smuzhiyun	    * the sender of the packet decides which are to be
520*4882a593Smuzhiyun	    * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
521*4882a593Smuzhiyun	    * before sending the packet
522*4882a593Smuzhiyun	    */
523*4882a593Smuzhiyun	    HWTSTAMP_TX_ON,
524*4882a593Smuzhiyun    };
525*4882a593Smuzhiyun
526*4882a593Smuzhiyun    /* possible values for hwtstamp_config->rx_filter */
527*4882a593Smuzhiyun    enum {
528*4882a593Smuzhiyun	    /* time stamp no incoming packet at all */
529*4882a593Smuzhiyun	    HWTSTAMP_FILTER_NONE,
530*4882a593Smuzhiyun
531*4882a593Smuzhiyun	    /* time stamp any incoming packet */
532*4882a593Smuzhiyun	    HWTSTAMP_FILTER_ALL,
533*4882a593Smuzhiyun
534*4882a593Smuzhiyun	    /* return value: time stamp all packets requested plus some others */
535*4882a593Smuzhiyun	    HWTSTAMP_FILTER_SOME,
536*4882a593Smuzhiyun
537*4882a593Smuzhiyun	    /* PTP v1, UDP, any kind of event packet */
538*4882a593Smuzhiyun	    HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
539*4882a593Smuzhiyun
540*4882a593Smuzhiyun	    /* for the complete list of values, please check
541*4882a593Smuzhiyun	    * the include file include/uapi/linux/net_tstamp.h
542*4882a593Smuzhiyun	    */
543*4882a593Smuzhiyun    };
544*4882a593Smuzhiyun
545*4882a593Smuzhiyun3.1 Hardware Timestamping Implementation: Device Drivers
546*4882a593Smuzhiyun--------------------------------------------------------
547*4882a593Smuzhiyun
548*4882a593SmuzhiyunA driver which supports hardware time stamping must support the
549*4882a593SmuzhiyunSIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
550*4882a593Smuzhiyunthe actual values as described in the section on SIOCSHWTSTAMP.  It
551*4882a593Smuzhiyunshould also support SIOCGHWTSTAMP.
552*4882a593Smuzhiyun
553*4882a593SmuzhiyunTime stamps for received packets must be stored in the skb. To get a pointer
554*4882a593Smuzhiyunto the shared time stamp structure of the skb call skb_hwtstamps(). Then
555*4882a593Smuzhiyunset the time stamps in the structure::
556*4882a593Smuzhiyun
557*4882a593Smuzhiyun    struct skb_shared_hwtstamps {
558*4882a593Smuzhiyun	    /* hardware time stamp transformed into duration
559*4882a593Smuzhiyun	    * since arbitrary point in time
560*4882a593Smuzhiyun	    */
561*4882a593Smuzhiyun	    ktime_t	hwtstamp;
562*4882a593Smuzhiyun    };
563*4882a593Smuzhiyun
564*4882a593SmuzhiyunTime stamps for outgoing packets are to be generated as follows:
565*4882a593Smuzhiyun
566*4882a593Smuzhiyun- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)
567*4882a593Smuzhiyun  is set no-zero. If yes, then the driver is expected to do hardware time
568*4882a593Smuzhiyun  stamping.
569*4882a593Smuzhiyun- If this is possible for the skb and requested, then declare
570*4882a593Smuzhiyun  that the driver is doing the time stamping by setting the flag
571*4882a593Smuzhiyun  SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with::
572*4882a593Smuzhiyun
573*4882a593Smuzhiyun      skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
574*4882a593Smuzhiyun
575*4882a593Smuzhiyun  You might want to keep a pointer to the associated skb for the next step
576*4882a593Smuzhiyun  and not free the skb. A driver not supporting hardware time stamping doesn't
577*4882a593Smuzhiyun  do that. A driver must never touch sk_buff::tstamp! It is used to store
578*4882a593Smuzhiyun  software generated time stamps by the network subsystem.
579*4882a593Smuzhiyun- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware
580*4882a593Smuzhiyun  as possible. skb_tx_timestamp() provides a software time stamp if requested
581*4882a593Smuzhiyun  and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
582*4882a593Smuzhiyun- As soon as the driver has sent the packet and/or obtained a
583*4882a593Smuzhiyun  hardware time stamp for it, it passes the time stamp back by
584*4882a593Smuzhiyun  calling skb_hwtstamp_tx() with the original skb, the raw
585*4882a593Smuzhiyun  hardware time stamp. skb_hwtstamp_tx() clones the original skb and
586*4882a593Smuzhiyun  adds the timestamps, therefore the original skb has to be freed now.
587*4882a593Smuzhiyun  If obtaining the hardware time stamp somehow fails, then the driver
588*4882a593Smuzhiyun  should not fall back to software time stamping. The rationale is that
589*4882a593Smuzhiyun  this would occur at a later time in the processing pipeline than other
590*4882a593Smuzhiyun  software time stamping and therefore could lead to unexpected deltas
591*4882a593Smuzhiyun  between time stamps.
592*4882a593Smuzhiyun
593*4882a593Smuzhiyun3.2 Special considerations for stacked PTP Hardware Clocks
594*4882a593Smuzhiyun----------------------------------------------------------
595*4882a593Smuzhiyun
596*4882a593SmuzhiyunThere are situations when there may be more than one PHC (PTP Hardware Clock)
597*4882a593Smuzhiyunin the data path of a packet. The kernel has no explicit mechanism to allow the
598*4882a593Smuzhiyunuser to select which PHC to use for timestamping Ethernet frames. Instead, the
599*4882a593Smuzhiyunassumption is that the outermost PHC is always the most preferable, and that
600*4882a593Smuzhiyunkernel drivers collaborate towards achieving that goal. Currently there are 3
601*4882a593Smuzhiyuncases of stacked PHCs, detailed below:
602*4882a593Smuzhiyun
603*4882a593Smuzhiyun3.2.1 DSA (Distributed Switch Architecture) switches
604*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
605*4882a593Smuzhiyun
606*4882a593SmuzhiyunThese are Ethernet switches which have one of their ports connected to an
607*4882a593Smuzhiyun(otherwise completely unaware) host Ethernet interface, and perform the role of
608*4882a593Smuzhiyuna port multiplier with optional forwarding acceleration features.  Each DSA
609*4882a593Smuzhiyunswitch port is visible to the user as a standalone (virtual) network interface,
610*4882a593Smuzhiyunand its network I/O is performed, under the hood, indirectly through the host
611*4882a593Smuzhiyuninterface (redirecting to the host port on TX, and intercepting frames on RX).
612*4882a593Smuzhiyun
613*4882a593SmuzhiyunWhen a DSA switch is attached to a host port, PTP synchronization has to
614*4882a593Smuzhiyunsuffer, since the switch's variable queuing delay introduces a path delay
615*4882a593Smuzhiyunjitter between the host port and its PTP partner. For this reason, some DSA
616*4882a593Smuzhiyunswitches include a timestamping clock of their own, and have the ability to
617*4882a593Smuzhiyunperform network timestamping on their own MAC, such that path delays only
618*4882a593Smuzhiyunmeasure wire and PHY propagation latencies. Timestamping DSA switches are
619*4882a593Smuzhiyunsupported in Linux and expose the same ABI as any other network interface (save
620*4882a593Smuzhiyunfor the fact that the DSA interfaces are in fact virtual in terms of network
621*4882a593SmuzhiyunI/O, they do have their own PHC).  It is typical, but not mandatory, for all
622*4882a593Smuzhiyuninterfaces of a DSA switch to share the same PHC.
623*4882a593Smuzhiyun
624*4882a593SmuzhiyunBy design, PTP timestamping with a DSA switch does not need any special
625*4882a593Smuzhiyunhandling in the driver for the host port it is attached to.  However, when the
626*4882a593Smuzhiyunhost port also supports PTP timestamping, DSA will take care of intercepting
627*4882a593Smuzhiyunthe ``.ndo_do_ioctl`` calls towards the host port, and block attempts to enable
628*4882a593Smuzhiyunhardware timestamping on it. This is because the SO_TIMESTAMPING API does not
629*4882a593Smuzhiyunallow the delivery of multiple hardware timestamps for the same packet, so
630*4882a593Smuzhiyunanybody else except for the DSA switch port must be prevented from doing so.
631*4882a593Smuzhiyun
632*4882a593SmuzhiyunIn code, DSA provides for most of the infrastructure for timestamping already,
633*4882a593Smuzhiyunin generic code: a BPF classifier (``ptp_classify_raw``) is used to identify
634*4882a593SmuzhiyunPTP event messages (any other packets, including PTP general messages, are not
635*4882a593Smuzhiyuntimestamped), and provides two hooks to drivers:
636*4882a593Smuzhiyun
637*4882a593Smuzhiyun- ``.port_txtstamp()``: The driver is passed a clone of the timestampable skb
638*4882a593Smuzhiyun  to be transmitted, before actually transmitting it. Typically, a switch will
639*4882a593Smuzhiyun  have a PTP TX timestamp register (or sometimes a FIFO) where the timestamp
640*4882a593Smuzhiyun  becomes available. There may be an IRQ that is raised upon this timestamp's
641*4882a593Smuzhiyun  availability, or the driver might have to poll after invoking
642*4882a593Smuzhiyun  ``dev_queue_xmit()`` towards the host interface. Either way, in the
643*4882a593Smuzhiyun  ``.port_txtstamp()`` method, the driver only needs to save the clone for
644*4882a593Smuzhiyun  later use (when the timestamp becomes available). Each skb is annotated with
645*4882a593Smuzhiyun  a pointer to its clone, in ``DSA_SKB_CB(skb)->clone``, to ease the driver's
646*4882a593Smuzhiyun  job of keeping track of which clone belongs to which skb.
647*4882a593Smuzhiyun
648*4882a593Smuzhiyun- ``.port_rxtstamp()``: The original (and only) timestampable skb is provided
649*4882a593Smuzhiyun  to the driver, for it to annotate it with a timestamp, if that is immediately
650*4882a593Smuzhiyun  available, or defer to later. On reception, timestamps might either be
651*4882a593Smuzhiyun  available in-band (through metadata in the DSA header, or attached in other
652*4882a593Smuzhiyun  ways to the packet), or out-of-band (through another RX timestamping FIFO).
653*4882a593Smuzhiyun  Deferral on RX is typically necessary when retrieving the timestamp needs a
654*4882a593Smuzhiyun  sleepable context. In that case, it is the responsibility of the DSA driver
655*4882a593Smuzhiyun  to call ``netif_rx_ni()`` on the freshly timestamped skb.
656*4882a593Smuzhiyun
657*4882a593Smuzhiyun3.2.2 Ethernet PHYs
658*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^
659*4882a593Smuzhiyun
660*4882a593SmuzhiyunThese are devices that typically fulfill a Layer 1 role in the network stack,
661*4882a593Smuzhiyunhence they do not have a representation in terms of a network interface as DSA
662*4882a593Smuzhiyunswitches do. However, PHYs may be able to detect and timestamp PTP packets, for
663*4882a593Smuzhiyunperformance reasons: timestamps taken as close as possible to the wire have the
664*4882a593Smuzhiyunpotential to yield a more stable and precise synchronization.
665*4882a593Smuzhiyun
666*4882a593SmuzhiyunA PHY driver that supports PTP timestamping must create a ``struct
667*4882a593Smuzhiyunmii_timestamper`` and add a pointer to it in ``phydev->mii_ts``. The presence
668*4882a593Smuzhiyunof this pointer will be checked by the networking stack.
669*4882a593Smuzhiyun
670*4882a593SmuzhiyunSince PHYs do not have network interface representations, the timestamping and
671*4882a593Smuzhiyunethtool ioctl operations for them need to be mediated by their respective MAC
672*4882a593Smuzhiyundriver.  Therefore, as opposed to DSA switches, modifications need to be done
673*4882a593Smuzhiyunto each individual MAC driver for PHY timestamping support. This entails:
674*4882a593Smuzhiyun
675*4882a593Smuzhiyun- Checking, in ``.ndo_do_ioctl``, whether ``phy_has_hwtstamp(netdev->phydev)``
676*4882a593Smuzhiyun  is true or not. If it is, then the MAC driver should not process this request
677*4882a593Smuzhiyun  but instead pass it on to the PHY using ``phy_mii_ioctl()``.
678*4882a593Smuzhiyun
679*4882a593Smuzhiyun- On RX, special intervention may or may not be needed, depending on the
680*4882a593Smuzhiyun  function used to deliver skb's up the network stack. In the case of plain
681*4882a593Smuzhiyun  ``netif_rx()`` and similar, MAC drivers must check whether
682*4882a593Smuzhiyun  ``skb_defer_rx_timestamp(skb)`` is necessary or not - and if it is, don't
683*4882a593Smuzhiyun  call ``netif_rx()`` at all.  If ``CONFIG_NETWORK_PHY_TIMESTAMPING`` is
684*4882a593Smuzhiyun  enabled, and ``skb->dev->phydev->mii_ts`` exists, its ``.rxtstamp()`` hook
685*4882a593Smuzhiyun  will be called now, to determine, using logic very similar to DSA, whether
686*4882a593Smuzhiyun  deferral for RX timestamping is necessary.  Again like DSA, it becomes the
687*4882a593Smuzhiyun  responsibility of the PHY driver to send the packet up the stack when the
688*4882a593Smuzhiyun  timestamp is available.
689*4882a593Smuzhiyun
690*4882a593Smuzhiyun  For other skb receive functions, such as ``napi_gro_receive`` and
691*4882a593Smuzhiyun  ``netif_receive_skb``, the stack automatically checks whether
692*4882a593Smuzhiyun  ``skb_defer_rx_timestamp()`` is necessary, so this check is not needed inside
693*4882a593Smuzhiyun  the driver.
694*4882a593Smuzhiyun
695*4882a593Smuzhiyun- On TX, again, special intervention might or might not be needed.  The
696*4882a593Smuzhiyun  function that calls the ``mii_ts->txtstamp()`` hook is named
697*4882a593Smuzhiyun  ``skb_clone_tx_timestamp()``. This function can either be called directly
698*4882a593Smuzhiyun  (case in which explicit MAC driver support is indeed needed), but the
699*4882a593Smuzhiyun  function also piggybacks from the ``skb_tx_timestamp()`` call, which many MAC
700*4882a593Smuzhiyun  drivers already perform for software timestamping purposes. Therefore, if a
701*4882a593Smuzhiyun  MAC supports software timestamping, it does not need to do anything further
702*4882a593Smuzhiyun  at this stage.
703*4882a593Smuzhiyun
704*4882a593Smuzhiyun3.2.3 MII bus snooping devices
705*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
706*4882a593Smuzhiyun
707*4882a593SmuzhiyunThese perform the same role as timestamping Ethernet PHYs, save for the fact
708*4882a593Smuzhiyunthat they are discrete devices and can therefore be used in conjunction with
709*4882a593Smuzhiyunany PHY even if it doesn't support timestamping. In Linux, they are
710*4882a593Smuzhiyundiscoverable and attachable to a ``struct phy_device`` through Device Tree, and
711*4882a593Smuzhiyunfor the rest, they use the same mii_ts infrastructure as those. See
712*4882a593SmuzhiyunDocumentation/devicetree/bindings/ptp/timestamper.txt for more details.
713*4882a593Smuzhiyun
714*4882a593Smuzhiyun3.2.4 Other caveats for MAC drivers
715*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
716*4882a593Smuzhiyun
717*4882a593SmuzhiyunStacked PHCs, especially DSA (but not only) - since that doesn't require any
718*4882a593Smuzhiyunmodification to MAC drivers, so it is more difficult to ensure correctness of
719*4882a593Smuzhiyunall possible code paths - is that they uncover bugs which were impossible to
720*4882a593Smuzhiyuntrigger before the existence of stacked PTP clocks.  One example has to do with
721*4882a593Smuzhiyunthis line of code, already presented earlier::
722*4882a593Smuzhiyun
723*4882a593Smuzhiyun      skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
724*4882a593Smuzhiyun
725*4882a593SmuzhiyunAny TX timestamping logic, be it a plain MAC driver, a DSA switch driver, a PHY
726*4882a593Smuzhiyundriver or a MII bus snooping device driver, should set this flag.
727*4882a593SmuzhiyunBut a MAC driver that is unaware of PHC stacking might get tripped up by
728*4882a593Smuzhiyunsomebody other than itself setting this flag, and deliver a duplicate
729*4882a593Smuzhiyuntimestamp.
730*4882a593SmuzhiyunFor example, a typical driver design for TX timestamping might be to split the
731*4882a593Smuzhiyuntransmission part into 2 portions:
732*4882a593Smuzhiyun
733*4882a593Smuzhiyun1. "TX": checks whether PTP timestamping has been previously enabled through
734*4882a593Smuzhiyun   the ``.ndo_do_ioctl`` ("``priv->hwtstamp_tx_enabled == true``") and the
735*4882a593Smuzhiyun   current skb requires a TX timestamp ("``skb_shinfo(skb)->tx_flags &
736*4882a593Smuzhiyun   SKBTX_HW_TSTAMP``"). If this is true, it sets the
737*4882a593Smuzhiyun   "``skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS``" flag. Note: as
738*4882a593Smuzhiyun   described above, in the case of a stacked PHC system, this condition should
739*4882a593Smuzhiyun   never trigger, as this MAC is certainly not the outermost PHC. But this is
740*4882a593Smuzhiyun   not where the typical issue is.  Transmission proceeds with this packet.
741*4882a593Smuzhiyun
742*4882a593Smuzhiyun2. "TX confirmation": Transmission has finished. The driver checks whether it
743*4882a593Smuzhiyun   is necessary to collect any TX timestamp for it. Here is where the typical
744*4882a593Smuzhiyun   issues are: the MAC driver takes a shortcut and only checks whether
745*4882a593Smuzhiyun   "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``" was set. With a stacked
746*4882a593Smuzhiyun   PHC system, this is incorrect because this MAC driver is not the only entity
747*4882a593Smuzhiyun   in the TX data path who could have enabled SKBTX_IN_PROGRESS in the first
748*4882a593Smuzhiyun   place.
749*4882a593Smuzhiyun
750*4882a593SmuzhiyunThe correct solution for this problem is for MAC drivers to have a compound
751*4882a593Smuzhiyuncheck in their "TX confirmation" portion, not only for
752*4882a593Smuzhiyun"``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``", but also for
753*4882a593Smuzhiyun"``priv->hwtstamp_tx_enabled == true``". Because the rest of the system ensures
754*4882a593Smuzhiyunthat PTP timestamping is not enabled for anything other than the outermost PHC,
755*4882a593Smuzhiyunthis enhanced check will avoid delivering a duplicated TX timestamp to user
756*4882a593Smuzhiyunspace.
757