1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============ 4*4882a593SmuzhiyunTimestamping 5*4882a593Smuzhiyun============ 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun 8*4882a593Smuzhiyun1. Control Interfaces 9*4882a593Smuzhiyun===================== 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunThe interfaces for receiving network packages timestamps are: 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunSO_TIMESTAMP 14*4882a593Smuzhiyun Generates a timestamp for each incoming packet in (not necessarily 15*4882a593Smuzhiyun monotonic) system time. Reports the timestamp via recvmsg() in a 16*4882a593Smuzhiyun control message in usec resolution. 17*4882a593Smuzhiyun SO_TIMESTAMP is defined as SO_TIMESTAMP_NEW or SO_TIMESTAMP_OLD 18*4882a593Smuzhiyun based on the architecture type and time_t representation of libc. 19*4882a593Smuzhiyun Control message format is in struct __kernel_old_timeval for 20*4882a593Smuzhiyun SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for 21*4882a593Smuzhiyun SO_TIMESTAMP_NEW options respectively. 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunSO_TIMESTAMPNS 24*4882a593Smuzhiyun Same timestamping mechanism as SO_TIMESTAMP, but reports the 25*4882a593Smuzhiyun timestamp as struct timespec in nsec resolution. 26*4882a593Smuzhiyun SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD 27*4882a593Smuzhiyun based on the architecture type and time_t representation of libc. 28*4882a593Smuzhiyun Control message format is in struct timespec for SO_TIMESTAMPNS_OLD 29*4882a593Smuzhiyun and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options 30*4882a593Smuzhiyun respectively. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunIP_MULTICAST_LOOP + SO_TIMESTAMP[NS] 33*4882a593Smuzhiyun Only for multicast:approximate transmit timestamp obtained by 34*4882a593Smuzhiyun reading the looped packet receive timestamp. 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunSO_TIMESTAMPING 37*4882a593Smuzhiyun Generates timestamps on reception, transmission or both. Supports 38*4882a593Smuzhiyun multiple timestamp sources, including hardware. Supports generating 39*4882a593Smuzhiyun timestamps for stream sockets. 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW) 43*4882a593Smuzhiyun------------------------------------------------------------- 44*4882a593Smuzhiyun 45*4882a593SmuzhiyunThis socket option enables timestamping of datagrams on the reception 46*4882a593Smuzhiyunpath. Because the destination socket, if any, is not known early in 47*4882a593Smuzhiyunthe network stack, the feature has to be enabled for all packets. The 48*4882a593Smuzhiyunsame is true for all early receive timestamp options. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunFor interface details, see `man 7 socket`. 51*4882a593Smuzhiyun 52*4882a593SmuzhiyunAlways use SO_TIMESTAMP_NEW timestamp to always get timestamp in 53*4882a593Smuzhiyunstruct __kernel_sock_timeval format. 54*4882a593Smuzhiyun 55*4882a593SmuzhiyunSO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038 56*4882a593Smuzhiyunon 32 bit machines. 57*4882a593Smuzhiyun 58*4882a593Smuzhiyun1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW): 59*4882a593Smuzhiyun 60*4882a593SmuzhiyunThis option is identical to SO_TIMESTAMP except for the returned data type. 61*4882a593SmuzhiyunIts struct timespec allows for higher resolution (ns) timestamps than the 62*4882a593Smuzhiyuntimeval of SO_TIMESTAMP (ms). 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunAlways use SO_TIMESTAMPNS_NEW timestamp to always get timestamp in 65*4882a593Smuzhiyunstruct __kernel_timespec format. 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunSO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038 68*4882a593Smuzhiyunon 32 bit machines. 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW) 71*4882a593Smuzhiyun---------------------------------------------------------------------- 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunSupports multiple types of timestamp requests. As a result, this 74*4882a593Smuzhiyunsocket option takes a bitmap of flags, not a boolean. In:: 75*4882a593Smuzhiyun 76*4882a593Smuzhiyun err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); 77*4882a593Smuzhiyun 78*4882a593Smuzhiyunval is an integer with any of the following bits set. Setting other 79*4882a593Smuzhiyunbit returns EINVAL and does not change the current state. 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunThe socket option configures timestamp generation for individual 82*4882a593Smuzhiyunsk_buffs (1.3.1), timestamp reporting to the socket's error 83*4882a593Smuzhiyunqueue (1.3.2) and options (1.3.3). Timestamp generation can also 84*4882a593Smuzhiyunbe enabled for individual sendmsg calls using cmsg (1.3.4). 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun 87*4882a593Smuzhiyun1.3.1 Timestamp Generation 88*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^ 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunSome bits are requests to the stack to try to generate timestamps. Any 91*4882a593Smuzhiyuncombination of them is valid. Changes to these bits apply to newly 92*4882a593Smuzhiyuncreated packets, not to packets already in the stack. As a result, it 93*4882a593Smuzhiyunis possible to selectively request timestamps for a subset of packets 94*4882a593Smuzhiyun(e.g., for sampling) by embedding an send() call within two setsockopt 95*4882a593Smuzhiyuncalls, one to enable timestamp generation and one to disable it. 96*4882a593SmuzhiyunTimestamps may also be generated for reasons other than being 97*4882a593Smuzhiyunrequested by a particular socket, such as when receive timestamping is 98*4882a593Smuzhiyunenabled system wide, as explained earlier. 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunSOF_TIMESTAMPING_RX_HARDWARE: 101*4882a593Smuzhiyun Request rx timestamps generated by the network adapter. 102*4882a593Smuzhiyun 103*4882a593SmuzhiyunSOF_TIMESTAMPING_RX_SOFTWARE: 104*4882a593Smuzhiyun Request rx timestamps when data enters the kernel. These timestamps 105*4882a593Smuzhiyun are generated just after a device driver hands a packet to the 106*4882a593Smuzhiyun kernel receive stack. 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_HARDWARE: 109*4882a593Smuzhiyun Request tx timestamps generated by the network adapter. This flag 110*4882a593Smuzhiyun can be enabled via both socket options and control messages. 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_SOFTWARE: 113*4882a593Smuzhiyun Request tx timestamps when data leaves the kernel. These timestamps 114*4882a593Smuzhiyun are generated in the device driver as close as possible, but always 115*4882a593Smuzhiyun prior to, passing the packet to the network interface. Hence, they 116*4882a593Smuzhiyun require driver support and may not be available for all devices. 117*4882a593Smuzhiyun This flag can be enabled via both socket options and control messages. 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_SCHED: 120*4882a593Smuzhiyun Request tx timestamps prior to entering the packet scheduler. Kernel 121*4882a593Smuzhiyun transmit latency is, if long, often dominated by queuing delay. The 122*4882a593Smuzhiyun difference between this timestamp and one taken at 123*4882a593Smuzhiyun SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent 124*4882a593Smuzhiyun of protocol processing. The latency incurred in protocol 125*4882a593Smuzhiyun processing, if any, can be computed by subtracting a userspace 126*4882a593Smuzhiyun timestamp taken immediately before send() from this timestamp. On 127*4882a593Smuzhiyun machines with virtual devices where a transmitted packet travels 128*4882a593Smuzhiyun through multiple devices and, hence, multiple packet schedulers, 129*4882a593Smuzhiyun a timestamp is generated at each layer. This allows for fine 130*4882a593Smuzhiyun grained measurement of queuing delay. This flag can be enabled 131*4882a593Smuzhiyun via both socket options and control messages. 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_ACK: 134*4882a593Smuzhiyun Request tx timestamps when all data in the send buffer has been 135*4882a593Smuzhiyun acknowledged. This only makes sense for reliable protocols. It is 136*4882a593Smuzhiyun currently only implemented for TCP. For that protocol, it may 137*4882a593Smuzhiyun over-report measurement, because the timestamp is generated when all 138*4882a593Smuzhiyun data up to and including the buffer at send() was acknowledged: the 139*4882a593Smuzhiyun cumulative acknowledgment. The mechanism ignores SACK and FACK. 140*4882a593Smuzhiyun This flag can be enabled via both socket options and control messages. 141*4882a593Smuzhiyun 142*4882a593Smuzhiyun 143*4882a593Smuzhiyun1.3.2 Timestamp Reporting 144*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^ 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunThe other three bits control which timestamps will be reported in a 147*4882a593Smuzhiyungenerated control message. Changes to the bits take immediate 148*4882a593Smuzhiyuneffect at the timestamp reporting locations in the stack. Timestamps 149*4882a593Smuzhiyunare only reported for packets that also have the relevant timestamp 150*4882a593Smuzhiyungeneration request set. 151*4882a593Smuzhiyun 152*4882a593SmuzhiyunSOF_TIMESTAMPING_SOFTWARE: 153*4882a593Smuzhiyun Report any software timestamps when available. 154*4882a593Smuzhiyun 155*4882a593SmuzhiyunSOF_TIMESTAMPING_SYS_HARDWARE: 156*4882a593Smuzhiyun This option is deprecated and ignored. 157*4882a593Smuzhiyun 158*4882a593SmuzhiyunSOF_TIMESTAMPING_RAW_HARDWARE: 159*4882a593Smuzhiyun Report hardware timestamps as generated by 160*4882a593Smuzhiyun SOF_TIMESTAMPING_TX_HARDWARE when available. 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun1.3.3 Timestamp Options 164*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^ 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunThe interface supports the options 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_ID: 169*4882a593Smuzhiyun Generate a unique identifier along with each packet. A process can 170*4882a593Smuzhiyun have multiple concurrent timestamping requests outstanding. Packets 171*4882a593Smuzhiyun can be reordered in the transmit path, for instance in the packet 172*4882a593Smuzhiyun scheduler. In that case timestamps will be queued onto the error 173*4882a593Smuzhiyun queue out of order from the original send() calls. It is not always 174*4882a593Smuzhiyun possible to uniquely match timestamps to the original send() calls 175*4882a593Smuzhiyun based on timestamp order or payload inspection alone, then. 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun This option associates each packet at send() with a unique 178*4882a593Smuzhiyun identifier and returns that along with the timestamp. The identifier 179*4882a593Smuzhiyun is derived from a per-socket u32 counter (that wraps). For datagram 180*4882a593Smuzhiyun sockets, the counter increments with each sent packet. For stream 181*4882a593Smuzhiyun sockets, it increments with every byte. 182*4882a593Smuzhiyun 183*4882a593Smuzhiyun The counter starts at zero. It is initialized the first time that 184*4882a593Smuzhiyun the socket option is enabled. It is reset each time the option is 185*4882a593Smuzhiyun enabled after having been disabled. Resetting the counter does not 186*4882a593Smuzhiyun change the identifiers of existing packets in the system. 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun This option is implemented only for transmit timestamps. There, the 189*4882a593Smuzhiyun timestamp is always looped along with a struct sock_extended_err. 190*4882a593Smuzhiyun The option modifies field ee_data to pass an id that is unique 191*4882a593Smuzhiyun among all possibly concurrently outstanding timestamp requests for 192*4882a593Smuzhiyun that socket. 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_CMSG: 196*4882a593Smuzhiyun Support recv() cmsg for all timestamped packets. Control messages 197*4882a593Smuzhiyun are already supported unconditionally on all packets with receive 198*4882a593Smuzhiyun timestamps and on IPv6 packets with transmit timestamp. This option 199*4882a593Smuzhiyun extends them to IPv4 packets with transmit timestamp. One use case 200*4882a593Smuzhiyun is to correlate packets with their egress device, by enabling socket 201*4882a593Smuzhiyun option IP_PKTINFO simultaneously. 202*4882a593Smuzhiyun 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_TSONLY: 205*4882a593Smuzhiyun Applies to transmit timestamps only. Makes the kernel return the 206*4882a593Smuzhiyun timestamp as a cmsg alongside an empty packet, as opposed to 207*4882a593Smuzhiyun alongside the original packet. This reduces the amount of memory 208*4882a593Smuzhiyun charged to the socket's receive budget (SO_RCVBUF) and delivers 209*4882a593Smuzhiyun the timestamp even if sysctl net.core.tstamp_allow_data is 0. 210*4882a593Smuzhiyun This option disables SOF_TIMESTAMPING_OPT_CMSG. 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_STATS: 213*4882a593Smuzhiyun Optional stats that are obtained along with the transmit timestamps. 214*4882a593Smuzhiyun It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the 215*4882a593Smuzhiyun transmit timestamp is available, the stats are available in a 216*4882a593Smuzhiyun separate control message of type SCM_TIMESTAMPING_OPT_STATS, as a 217*4882a593Smuzhiyun list of TLVs (struct nlattr) of types. These stats allow the 218*4882a593Smuzhiyun application to associate various transport layer stats with 219*4882a593Smuzhiyun the transmit timestamps, such as how long a certain block of 220*4882a593Smuzhiyun data was limited by peer's receiver window. 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_PKTINFO: 223*4882a593Smuzhiyun Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming 224*4882a593Smuzhiyun packets with hardware timestamps. The message contains struct 225*4882a593Smuzhiyun scm_ts_pktinfo, which supplies the index of the real interface which 226*4882a593Smuzhiyun received the packet and its length at layer 2. A valid (non-zero) 227*4882a593Smuzhiyun interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is 228*4882a593Smuzhiyun enabled and the driver is using NAPI. The struct contains also two 229*4882a593Smuzhiyun other fields, but they are reserved and undefined. 230*4882a593Smuzhiyun 231*4882a593SmuzhiyunSOF_TIMESTAMPING_OPT_TX_SWHW: 232*4882a593Smuzhiyun Request both hardware and software timestamps for outgoing packets 233*4882a593Smuzhiyun when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE 234*4882a593Smuzhiyun are enabled at the same time. If both timestamps are generated, 235*4882a593Smuzhiyun two separate messages will be looped to the socket's error queue, 236*4882a593Smuzhiyun each containing just one timestamp. 237*4882a593Smuzhiyun 238*4882a593SmuzhiyunNew applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to 239*4882a593Smuzhiyundisambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate 240*4882a593Smuzhiyunregardless of the setting of sysctl net.core.tstamp_allow_data. 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunAn exception is when a process needs additional cmsg data, for 243*4882a593Smuzhiyuninstance SOL_IP/IP_PKTINFO to detect the egress network interface. 244*4882a593SmuzhiyunThen pass option SOF_TIMESTAMPING_OPT_CMSG. This option depends on 245*4882a593Smuzhiyunhaving access to the contents of the original packet, so cannot be 246*4882a593Smuzhiyuncombined with SOF_TIMESTAMPING_OPT_TSONLY. 247*4882a593Smuzhiyun 248*4882a593Smuzhiyun 249*4882a593Smuzhiyun1.3.4. Enabling timestamps via control messages 250*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 251*4882a593Smuzhiyun 252*4882a593SmuzhiyunIn addition to socket options, timestamp generation can be requested 253*4882a593Smuzhiyunper write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1). 254*4882a593SmuzhiyunUsing this feature, applications can sample timestamps per sendmsg() 255*4882a593Smuzhiyunwithout paying the overhead of enabling and disabling timestamps via 256*4882a593Smuzhiyunsetsockopt:: 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun struct msghdr *msg; 259*4882a593Smuzhiyun ... 260*4882a593Smuzhiyun cmsg = CMSG_FIRSTHDR(msg); 261*4882a593Smuzhiyun cmsg->cmsg_level = SOL_SOCKET; 262*4882a593Smuzhiyun cmsg->cmsg_type = SO_TIMESTAMPING; 263*4882a593Smuzhiyun cmsg->cmsg_len = CMSG_LEN(sizeof(__u32)); 264*4882a593Smuzhiyun *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED | 265*4882a593Smuzhiyun SOF_TIMESTAMPING_TX_SOFTWARE | 266*4882a593Smuzhiyun SOF_TIMESTAMPING_TX_ACK; 267*4882a593Smuzhiyun err = sendmsg(fd, msg, 0); 268*4882a593Smuzhiyun 269*4882a593SmuzhiyunThe SOF_TIMESTAMPING_TX_* flags set via cmsg will override 270*4882a593Smuzhiyunthe SOF_TIMESTAMPING_TX_* flags set via setsockopt. 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunMoreover, applications must still enable timestamp reporting via 273*4882a593Smuzhiyunsetsockopt to receive timestamps:: 274*4882a593Smuzhiyun 275*4882a593Smuzhiyun __u32 val = SOF_TIMESTAMPING_SOFTWARE | 276*4882a593Smuzhiyun SOF_TIMESTAMPING_OPT_ID /* or any other flag */; 277*4882a593Smuzhiyun err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); 278*4882a593Smuzhiyun 279*4882a593Smuzhiyun 280*4882a593Smuzhiyun1.4 Bytestream Timestamps 281*4882a593Smuzhiyun------------------------- 282*4882a593Smuzhiyun 283*4882a593SmuzhiyunThe SO_TIMESTAMPING interface supports timestamping of bytes in a 284*4882a593Smuzhiyunbytestream. Each request is interpreted as a request for when the 285*4882a593Smuzhiyunentire contents of the buffer has passed a timestamping point. That 286*4882a593Smuzhiyunis, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record 287*4882a593Smuzhiyunwhen all bytes have reached the device driver, regardless of how 288*4882a593Smuzhiyunmany packets the data has been converted into. 289*4882a593Smuzhiyun 290*4882a593SmuzhiyunIn general, bytestreams have no natural delimiters and therefore 291*4882a593Smuzhiyuncorrelating a timestamp with data is non-trivial. A range of bytes 292*4882a593Smuzhiyunmay be split across segments, any segments may be merged (possibly 293*4882a593Smuzhiyuncoalescing sections of previously segmented buffers associated with 294*4882a593Smuzhiyunindependent send() calls). Segments can be reordered and the same 295*4882a593Smuzhiyunbyte range can coexist in multiple segments for protocols that 296*4882a593Smuzhiyunimplement retransmissions. 297*4882a593Smuzhiyun 298*4882a593SmuzhiyunIt is essential that all timestamps implement the same semantics, 299*4882a593Smuzhiyunregardless of these possible transformations, as otherwise they are 300*4882a593Smuzhiyunincomparable. Handling "rare" corner cases differently from the 301*4882a593Smuzhiyunsimple case (a 1:1 mapping from buffer to skb) is insufficient 302*4882a593Smuzhiyunbecause performance debugging often needs to focus on such outliers. 303*4882a593Smuzhiyun 304*4882a593SmuzhiyunIn practice, timestamps can be correlated with segments of a 305*4882a593Smuzhiyunbytestream consistently, if both semantics of the timestamp and the 306*4882a593Smuzhiyuntiming of measurement are chosen correctly. This challenge is no 307*4882a593Smuzhiyundifferent from deciding on a strategy for IP fragmentation. There, the 308*4882a593Smuzhiyundefinition is that only the first fragment is timestamped. For 309*4882a593Smuzhiyunbytestreams, we chose that a timestamp is generated only when all 310*4882a593Smuzhiyunbytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to 311*4882a593Smuzhiyunimplement and reason about. An implementation that has to take into 312*4882a593Smuzhiyunaccount SACK would be more complex due to possible transmission holes 313*4882a593Smuzhiyunand out of order arrival. 314*4882a593Smuzhiyun 315*4882a593SmuzhiyunOn the host, TCP can also break the simple 1:1 mapping from buffer to 316*4882a593Smuzhiyunskbuff as a result of Nagle, cork, autocork, segmentation and GSO. The 317*4882a593Smuzhiyunimplementation ensures correctness in all cases by tracking the 318*4882a593Smuzhiyunindividual last byte passed to send(), even if it is no longer the 319*4882a593Smuzhiyunlast byte after an skbuff extend or merge operation. It stores the 320*4882a593Smuzhiyunrelevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff 321*4882a593Smuzhiyunhas only one such field, only one timestamp can be generated. 322*4882a593Smuzhiyun 323*4882a593SmuzhiyunIn rare cases, a timestamp request can be missed if two requests are 324*4882a593Smuzhiyuncollapsed onto the same skb. A process can detect this situation by 325*4882a593Smuzhiyunenabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at 326*4882a593Smuzhiyunsend time with the value returned for each timestamp. It can prevent 327*4882a593Smuzhiyunthe situation by always flushing the TCP stack in between requests, 328*4882a593Smuzhiyunfor instance by enabling TCP_NODELAY and disabling TCP_CORK and 329*4882a593Smuzhiyunautocork. 330*4882a593Smuzhiyun 331*4882a593SmuzhiyunThese precautions ensure that the timestamp is generated only when all 332*4882a593Smuzhiyunbytes have passed a timestamp point, assuming that the network stack 333*4882a593Smuzhiyunitself does not reorder the segments. The stack indeed tries to avoid 334*4882a593Smuzhiyunreordering. The one exception is under administrator control: it is 335*4882a593Smuzhiyunpossible to construct a packet scheduler configuration that delays 336*4882a593Smuzhiyunsegments from the same stream differently. Such a setup would be 337*4882a593Smuzhiyununusual. 338*4882a593Smuzhiyun 339*4882a593Smuzhiyun 340*4882a593Smuzhiyun2 Data Interfaces 341*4882a593Smuzhiyun================== 342*4882a593Smuzhiyun 343*4882a593SmuzhiyunTimestamps are read using the ancillary data feature of recvmsg(). 344*4882a593SmuzhiyunSee `man 3 cmsg` for details of this interface. The socket manual 345*4882a593Smuzhiyunpage (`man 7 socket`) describes how timestamps generated with 346*4882a593SmuzhiyunSO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved. 347*4882a593Smuzhiyun 348*4882a593Smuzhiyun 349*4882a593Smuzhiyun2.1 SCM_TIMESTAMPING records 350*4882a593Smuzhiyun---------------------------- 351*4882a593Smuzhiyun 352*4882a593SmuzhiyunThese timestamps are returned in a control message with cmsg_level 353*4882a593SmuzhiyunSOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type 354*4882a593Smuzhiyun 355*4882a593SmuzhiyunFor SO_TIMESTAMPING_OLD:: 356*4882a593Smuzhiyun 357*4882a593Smuzhiyun struct scm_timestamping { 358*4882a593Smuzhiyun struct timespec ts[3]; 359*4882a593Smuzhiyun }; 360*4882a593Smuzhiyun 361*4882a593SmuzhiyunFor SO_TIMESTAMPING_NEW:: 362*4882a593Smuzhiyun 363*4882a593Smuzhiyun struct scm_timestamping64 { 364*4882a593Smuzhiyun struct __kernel_timespec ts[3]; 365*4882a593Smuzhiyun 366*4882a593SmuzhiyunAlways use SO_TIMESTAMPING_NEW timestamp to always get timestamp in 367*4882a593Smuzhiyunstruct scm_timestamping64 format. 368*4882a593Smuzhiyun 369*4882a593SmuzhiyunSO_TIMESTAMPING_OLD returns incorrect timestamps after the year 2038 370*4882a593Smuzhiyunon 32 bit machines. 371*4882a593Smuzhiyun 372*4882a593SmuzhiyunThe structure can return up to three timestamps. This is a legacy 373*4882a593Smuzhiyunfeature. At least one field is non-zero at any time. Most timestamps 374*4882a593Smuzhiyunare passed in ts[0]. Hardware timestamps are passed in ts[2]. 375*4882a593Smuzhiyun 376*4882a593Smuzhiyunts[1] used to hold hardware timestamps converted to system time. 377*4882a593SmuzhiyunInstead, expose the hardware clock device on the NIC directly as 378*4882a593Smuzhiyuna HW PTP clock source, to allow time conversion in userspace and 379*4882a593Smuzhiyunoptionally synchronize system time with a userspace PTP stack such 380*4882a593Smuzhiyunas linuxptp. For the PTP clock API, see Documentation/driver-api/ptp.rst. 381*4882a593Smuzhiyun 382*4882a593SmuzhiyunNote that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled 383*4882a593Smuzhiyuntogether with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false 384*4882a593Smuzhiyunsoftware timestamp will be generated in the recvmsg() call and passed 385*4882a593Smuzhiyunin ts[0] when a real software timestamp is missing. This happens also 386*4882a593Smuzhiyunon hardware transmit timestamps. 387*4882a593Smuzhiyun 388*4882a593Smuzhiyun2.1.1 Transmit timestamps with MSG_ERRQUEUE 389*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 390*4882a593Smuzhiyun 391*4882a593SmuzhiyunFor transmit timestamps the outgoing packet is looped back to the 392*4882a593Smuzhiyunsocket's error queue with the send timestamp(s) attached. A process 393*4882a593Smuzhiyunreceives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE 394*4882a593Smuzhiyunset and with a msg_control buffer sufficiently large to receive the 395*4882a593Smuzhiyunrelevant metadata structures. The recvmsg call returns the original 396*4882a593Smuzhiyunoutgoing data packet with two ancillary messages attached. 397*4882a593Smuzhiyun 398*4882a593SmuzhiyunA message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR 399*4882a593Smuzhiyunembeds a struct sock_extended_err. This defines the error type. For 400*4882a593Smuzhiyuntimestamps, the ee_errno field is ENOMSG. The other ancillary message 401*4882a593Smuzhiyunwill have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This 402*4882a593Smuzhiyunembeds the struct scm_timestamping. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun 405*4882a593Smuzhiyun2.1.1.2 Timestamp types 406*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~ 407*4882a593Smuzhiyun 408*4882a593SmuzhiyunThe semantics of the three struct timespec are defined by field 409*4882a593Smuzhiyunee_info in the extended error structure. It contains a value of 410*4882a593Smuzhiyuntype SCM_TSTAMP_* to define the actual timestamp passed in 411*4882a593Smuzhiyunscm_timestamping. 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunThe SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_* 414*4882a593Smuzhiyuncontrol fields discussed previously, with one exception. For legacy 415*4882a593Smuzhiyunreasons, SCM_TSTAMP_SND is equal to zero and can be set for both 416*4882a593SmuzhiyunSOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It 417*4882a593Smuzhiyunis the first if ts[2] is non-zero, the second otherwise, in which 418*4882a593Smuzhiyuncase the timestamp is stored in ts[0]. 419*4882a593Smuzhiyun 420*4882a593Smuzhiyun 421*4882a593Smuzhiyun2.1.1.3 Fragmentation 422*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~ 423*4882a593Smuzhiyun 424*4882a593SmuzhiyunFragmentation of outgoing datagrams is rare, but is possible, e.g., by 425*4882a593Smuzhiyunexplicitly disabling PMTU discovery. If an outgoing packet is fragmented, 426*4882a593Smuzhiyunthen only the first fragment is timestamped and returned to the sending 427*4882a593Smuzhiyunsocket. 428*4882a593Smuzhiyun 429*4882a593Smuzhiyun 430*4882a593Smuzhiyun2.1.1.4 Packet Payload 431*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~ 432*4882a593Smuzhiyun 433*4882a593SmuzhiyunThe calling application is often not interested in receiving the whole 434*4882a593Smuzhiyunpacket payload that it passed to the stack originally: the socket 435*4882a593Smuzhiyunerror queue mechanism is just a method to piggyback the timestamp on. 436*4882a593SmuzhiyunIn this case, the application can choose to read datagrams with a 437*4882a593Smuzhiyunsmaller buffer, possibly even of length 0. The payload is truncated 438*4882a593Smuzhiyunaccordingly. Until the process calls recvmsg() on the error queue, 439*4882a593Smuzhiyunhowever, the full packet is queued, taking up budget from SO_RCVBUF. 440*4882a593Smuzhiyun 441*4882a593Smuzhiyun 442*4882a593Smuzhiyun2.1.1.5 Blocking Read 443*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~ 444*4882a593Smuzhiyun 445*4882a593SmuzhiyunReading from the error queue is always a non-blocking operation. To 446*4882a593Smuzhiyunblock waiting on a timestamp, use poll or select. poll() will return 447*4882a593SmuzhiyunPOLLERR in pollfd.revents if any data is ready on the error queue. 448*4882a593SmuzhiyunThere is no need to pass this flag in pollfd.events. This flag is 449*4882a593Smuzhiyunignored on request. See also `man 2 poll`. 450*4882a593Smuzhiyun 451*4882a593Smuzhiyun 452*4882a593Smuzhiyun2.1.2 Receive timestamps 453*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^ 454*4882a593Smuzhiyun 455*4882a593SmuzhiyunOn reception, there is no reason to read from the socket error queue. 456*4882a593SmuzhiyunThe SCM_TIMESTAMPING ancillary data is sent along with the packet data 457*4882a593Smuzhiyunon a normal recvmsg(). Since this is not a socket error, it is not 458*4882a593Smuzhiyunaccompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case, 459*4882a593Smuzhiyunthe meaning of the three fields in struct scm_timestamping is 460*4882a593Smuzhiyunimplicitly defined. ts[0] holds a software timestamp if set, ts[1] 461*4882a593Smuzhiyunis again deprecated and ts[2] holds a hardware timestamp if set. 462*4882a593Smuzhiyun 463*4882a593Smuzhiyun 464*4882a593Smuzhiyun3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP 465*4882a593Smuzhiyun======================================================================= 466*4882a593Smuzhiyun 467*4882a593SmuzhiyunHardware time stamping must also be initialized for each device driver 468*4882a593Smuzhiyunthat is expected to do hardware time stamping. The parameter is defined in 469*4882a593Smuzhiyuninclude/uapi/linux/net_tstamp.h as:: 470*4882a593Smuzhiyun 471*4882a593Smuzhiyun struct hwtstamp_config { 472*4882a593Smuzhiyun int flags; /* no flags defined right now, must be zero */ 473*4882a593Smuzhiyun int tx_type; /* HWTSTAMP_TX_* */ 474*4882a593Smuzhiyun int rx_filter; /* HWTSTAMP_FILTER_* */ 475*4882a593Smuzhiyun }; 476*4882a593Smuzhiyun 477*4882a593SmuzhiyunDesired behavior is passed into the kernel and to a specific device by 478*4882a593Smuzhiyuncalling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose 479*4882a593Smuzhiyunifr_data points to a struct hwtstamp_config. The tx_type and 480*4882a593Smuzhiyunrx_filter are hints to the driver what it is expected to do. If 481*4882a593Smuzhiyunthe requested fine-grained filtering for incoming packets is not 482*4882a593Smuzhiyunsupported, the driver may time stamp more than just the requested types 483*4882a593Smuzhiyunof packets. 484*4882a593Smuzhiyun 485*4882a593SmuzhiyunDrivers are free to use a more permissive configuration than the requested 486*4882a593Smuzhiyunconfiguration. It is expected that drivers should only implement directly the 487*4882a593Smuzhiyunmost generic mode that can be supported. For example if the hardware can 488*4882a593Smuzhiyunsupport HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale 489*4882a593SmuzhiyunHWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT 490*4882a593Smuzhiyunis more generic (and more useful to applications). 491*4882a593Smuzhiyun 492*4882a593SmuzhiyunA driver which supports hardware time stamping shall update the struct 493*4882a593Smuzhiyunwith the actual, possibly more permissive configuration. If the 494*4882a593Smuzhiyunrequested packets cannot be time stamped, then nothing should be 495*4882a593Smuzhiyunchanged and ERANGE shall be returned (in contrast to EINVAL, which 496*4882a593Smuzhiyunindicates that SIOCSHWTSTAMP is not supported at all). 497*4882a593Smuzhiyun 498*4882a593SmuzhiyunOnly a processes with admin rights may change the configuration. User 499*4882a593Smuzhiyunspace is responsible to ensure that multiple processes don't interfere 500*4882a593Smuzhiyunwith each other and that the settings are reset. 501*4882a593Smuzhiyun 502*4882a593SmuzhiyunAny process can read the actual configuration by passing this 503*4882a593Smuzhiyunstructure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has 504*4882a593Smuzhiyunnot been implemented in all drivers. 505*4882a593Smuzhiyun 506*4882a593Smuzhiyun:: 507*4882a593Smuzhiyun 508*4882a593Smuzhiyun /* possible values for hwtstamp_config->tx_type */ 509*4882a593Smuzhiyun enum { 510*4882a593Smuzhiyun /* 511*4882a593Smuzhiyun * no outgoing packet will need hardware time stamping; 512*4882a593Smuzhiyun * should a packet arrive which asks for it, no hardware 513*4882a593Smuzhiyun * time stamping will be done 514*4882a593Smuzhiyun */ 515*4882a593Smuzhiyun HWTSTAMP_TX_OFF, 516*4882a593Smuzhiyun 517*4882a593Smuzhiyun /* 518*4882a593Smuzhiyun * enables hardware time stamping for outgoing packets; 519*4882a593Smuzhiyun * the sender of the packet decides which are to be 520*4882a593Smuzhiyun * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE 521*4882a593Smuzhiyun * before sending the packet 522*4882a593Smuzhiyun */ 523*4882a593Smuzhiyun HWTSTAMP_TX_ON, 524*4882a593Smuzhiyun }; 525*4882a593Smuzhiyun 526*4882a593Smuzhiyun /* possible values for hwtstamp_config->rx_filter */ 527*4882a593Smuzhiyun enum { 528*4882a593Smuzhiyun /* time stamp no incoming packet at all */ 529*4882a593Smuzhiyun HWTSTAMP_FILTER_NONE, 530*4882a593Smuzhiyun 531*4882a593Smuzhiyun /* time stamp any incoming packet */ 532*4882a593Smuzhiyun HWTSTAMP_FILTER_ALL, 533*4882a593Smuzhiyun 534*4882a593Smuzhiyun /* return value: time stamp all packets requested plus some others */ 535*4882a593Smuzhiyun HWTSTAMP_FILTER_SOME, 536*4882a593Smuzhiyun 537*4882a593Smuzhiyun /* PTP v1, UDP, any kind of event packet */ 538*4882a593Smuzhiyun HWTSTAMP_FILTER_PTP_V1_L4_EVENT, 539*4882a593Smuzhiyun 540*4882a593Smuzhiyun /* for the complete list of values, please check 541*4882a593Smuzhiyun * the include file include/uapi/linux/net_tstamp.h 542*4882a593Smuzhiyun */ 543*4882a593Smuzhiyun }; 544*4882a593Smuzhiyun 545*4882a593Smuzhiyun3.1 Hardware Timestamping Implementation: Device Drivers 546*4882a593Smuzhiyun-------------------------------------------------------- 547*4882a593Smuzhiyun 548*4882a593SmuzhiyunA driver which supports hardware time stamping must support the 549*4882a593SmuzhiyunSIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with 550*4882a593Smuzhiyunthe actual values as described in the section on SIOCSHWTSTAMP. It 551*4882a593Smuzhiyunshould also support SIOCGHWTSTAMP. 552*4882a593Smuzhiyun 553*4882a593SmuzhiyunTime stamps for received packets must be stored in the skb. To get a pointer 554*4882a593Smuzhiyunto the shared time stamp structure of the skb call skb_hwtstamps(). Then 555*4882a593Smuzhiyunset the time stamps in the structure:: 556*4882a593Smuzhiyun 557*4882a593Smuzhiyun struct skb_shared_hwtstamps { 558*4882a593Smuzhiyun /* hardware time stamp transformed into duration 559*4882a593Smuzhiyun * since arbitrary point in time 560*4882a593Smuzhiyun */ 561*4882a593Smuzhiyun ktime_t hwtstamp; 562*4882a593Smuzhiyun }; 563*4882a593Smuzhiyun 564*4882a593SmuzhiyunTime stamps for outgoing packets are to be generated as follows: 565*4882a593Smuzhiyun 566*4882a593Smuzhiyun- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) 567*4882a593Smuzhiyun is set no-zero. If yes, then the driver is expected to do hardware time 568*4882a593Smuzhiyun stamping. 569*4882a593Smuzhiyun- If this is possible for the skb and requested, then declare 570*4882a593Smuzhiyun that the driver is doing the time stamping by setting the flag 571*4882a593Smuzhiyun SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with:: 572*4882a593Smuzhiyun 573*4882a593Smuzhiyun skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; 574*4882a593Smuzhiyun 575*4882a593Smuzhiyun You might want to keep a pointer to the associated skb for the next step 576*4882a593Smuzhiyun and not free the skb. A driver not supporting hardware time stamping doesn't 577*4882a593Smuzhiyun do that. A driver must never touch sk_buff::tstamp! It is used to store 578*4882a593Smuzhiyun software generated time stamps by the network subsystem. 579*4882a593Smuzhiyun- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware 580*4882a593Smuzhiyun as possible. skb_tx_timestamp() provides a software time stamp if requested 581*4882a593Smuzhiyun and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set). 582*4882a593Smuzhiyun- As soon as the driver has sent the packet and/or obtained a 583*4882a593Smuzhiyun hardware time stamp for it, it passes the time stamp back by 584*4882a593Smuzhiyun calling skb_hwtstamp_tx() with the original skb, the raw 585*4882a593Smuzhiyun hardware time stamp. skb_hwtstamp_tx() clones the original skb and 586*4882a593Smuzhiyun adds the timestamps, therefore the original skb has to be freed now. 587*4882a593Smuzhiyun If obtaining the hardware time stamp somehow fails, then the driver 588*4882a593Smuzhiyun should not fall back to software time stamping. The rationale is that 589*4882a593Smuzhiyun this would occur at a later time in the processing pipeline than other 590*4882a593Smuzhiyun software time stamping and therefore could lead to unexpected deltas 591*4882a593Smuzhiyun between time stamps. 592*4882a593Smuzhiyun 593*4882a593Smuzhiyun3.2 Special considerations for stacked PTP Hardware Clocks 594*4882a593Smuzhiyun---------------------------------------------------------- 595*4882a593Smuzhiyun 596*4882a593SmuzhiyunThere are situations when there may be more than one PHC (PTP Hardware Clock) 597*4882a593Smuzhiyunin the data path of a packet. The kernel has no explicit mechanism to allow the 598*4882a593Smuzhiyunuser to select which PHC to use for timestamping Ethernet frames. Instead, the 599*4882a593Smuzhiyunassumption is that the outermost PHC is always the most preferable, and that 600*4882a593Smuzhiyunkernel drivers collaborate towards achieving that goal. Currently there are 3 601*4882a593Smuzhiyuncases of stacked PHCs, detailed below: 602*4882a593Smuzhiyun 603*4882a593Smuzhiyun3.2.1 DSA (Distributed Switch Architecture) switches 604*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 605*4882a593Smuzhiyun 606*4882a593SmuzhiyunThese are Ethernet switches which have one of their ports connected to an 607*4882a593Smuzhiyun(otherwise completely unaware) host Ethernet interface, and perform the role of 608*4882a593Smuzhiyuna port multiplier with optional forwarding acceleration features. Each DSA 609*4882a593Smuzhiyunswitch port is visible to the user as a standalone (virtual) network interface, 610*4882a593Smuzhiyunand its network I/O is performed, under the hood, indirectly through the host 611*4882a593Smuzhiyuninterface (redirecting to the host port on TX, and intercepting frames on RX). 612*4882a593Smuzhiyun 613*4882a593SmuzhiyunWhen a DSA switch is attached to a host port, PTP synchronization has to 614*4882a593Smuzhiyunsuffer, since the switch's variable queuing delay introduces a path delay 615*4882a593Smuzhiyunjitter between the host port and its PTP partner. For this reason, some DSA 616*4882a593Smuzhiyunswitches include a timestamping clock of their own, and have the ability to 617*4882a593Smuzhiyunperform network timestamping on their own MAC, such that path delays only 618*4882a593Smuzhiyunmeasure wire and PHY propagation latencies. Timestamping DSA switches are 619*4882a593Smuzhiyunsupported in Linux and expose the same ABI as any other network interface (save 620*4882a593Smuzhiyunfor the fact that the DSA interfaces are in fact virtual in terms of network 621*4882a593SmuzhiyunI/O, they do have their own PHC). It is typical, but not mandatory, for all 622*4882a593Smuzhiyuninterfaces of a DSA switch to share the same PHC. 623*4882a593Smuzhiyun 624*4882a593SmuzhiyunBy design, PTP timestamping with a DSA switch does not need any special 625*4882a593Smuzhiyunhandling in the driver for the host port it is attached to. However, when the 626*4882a593Smuzhiyunhost port also supports PTP timestamping, DSA will take care of intercepting 627*4882a593Smuzhiyunthe ``.ndo_do_ioctl`` calls towards the host port, and block attempts to enable 628*4882a593Smuzhiyunhardware timestamping on it. This is because the SO_TIMESTAMPING API does not 629*4882a593Smuzhiyunallow the delivery of multiple hardware timestamps for the same packet, so 630*4882a593Smuzhiyunanybody else except for the DSA switch port must be prevented from doing so. 631*4882a593Smuzhiyun 632*4882a593SmuzhiyunIn code, DSA provides for most of the infrastructure for timestamping already, 633*4882a593Smuzhiyunin generic code: a BPF classifier (``ptp_classify_raw``) is used to identify 634*4882a593SmuzhiyunPTP event messages (any other packets, including PTP general messages, are not 635*4882a593Smuzhiyuntimestamped), and provides two hooks to drivers: 636*4882a593Smuzhiyun 637*4882a593Smuzhiyun- ``.port_txtstamp()``: The driver is passed a clone of the timestampable skb 638*4882a593Smuzhiyun to be transmitted, before actually transmitting it. Typically, a switch will 639*4882a593Smuzhiyun have a PTP TX timestamp register (or sometimes a FIFO) where the timestamp 640*4882a593Smuzhiyun becomes available. There may be an IRQ that is raised upon this timestamp's 641*4882a593Smuzhiyun availability, or the driver might have to poll after invoking 642*4882a593Smuzhiyun ``dev_queue_xmit()`` towards the host interface. Either way, in the 643*4882a593Smuzhiyun ``.port_txtstamp()`` method, the driver only needs to save the clone for 644*4882a593Smuzhiyun later use (when the timestamp becomes available). Each skb is annotated with 645*4882a593Smuzhiyun a pointer to its clone, in ``DSA_SKB_CB(skb)->clone``, to ease the driver's 646*4882a593Smuzhiyun job of keeping track of which clone belongs to which skb. 647*4882a593Smuzhiyun 648*4882a593Smuzhiyun- ``.port_rxtstamp()``: The original (and only) timestampable skb is provided 649*4882a593Smuzhiyun to the driver, for it to annotate it with a timestamp, if that is immediately 650*4882a593Smuzhiyun available, or defer to later. On reception, timestamps might either be 651*4882a593Smuzhiyun available in-band (through metadata in the DSA header, or attached in other 652*4882a593Smuzhiyun ways to the packet), or out-of-band (through another RX timestamping FIFO). 653*4882a593Smuzhiyun Deferral on RX is typically necessary when retrieving the timestamp needs a 654*4882a593Smuzhiyun sleepable context. In that case, it is the responsibility of the DSA driver 655*4882a593Smuzhiyun to call ``netif_rx_ni()`` on the freshly timestamped skb. 656*4882a593Smuzhiyun 657*4882a593Smuzhiyun3.2.2 Ethernet PHYs 658*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^ 659*4882a593Smuzhiyun 660*4882a593SmuzhiyunThese are devices that typically fulfill a Layer 1 role in the network stack, 661*4882a593Smuzhiyunhence they do not have a representation in terms of a network interface as DSA 662*4882a593Smuzhiyunswitches do. However, PHYs may be able to detect and timestamp PTP packets, for 663*4882a593Smuzhiyunperformance reasons: timestamps taken as close as possible to the wire have the 664*4882a593Smuzhiyunpotential to yield a more stable and precise synchronization. 665*4882a593Smuzhiyun 666*4882a593SmuzhiyunA PHY driver that supports PTP timestamping must create a ``struct 667*4882a593Smuzhiyunmii_timestamper`` and add a pointer to it in ``phydev->mii_ts``. The presence 668*4882a593Smuzhiyunof this pointer will be checked by the networking stack. 669*4882a593Smuzhiyun 670*4882a593SmuzhiyunSince PHYs do not have network interface representations, the timestamping and 671*4882a593Smuzhiyunethtool ioctl operations for them need to be mediated by their respective MAC 672*4882a593Smuzhiyundriver. Therefore, as opposed to DSA switches, modifications need to be done 673*4882a593Smuzhiyunto each individual MAC driver for PHY timestamping support. This entails: 674*4882a593Smuzhiyun 675*4882a593Smuzhiyun- Checking, in ``.ndo_do_ioctl``, whether ``phy_has_hwtstamp(netdev->phydev)`` 676*4882a593Smuzhiyun is true or not. If it is, then the MAC driver should not process this request 677*4882a593Smuzhiyun but instead pass it on to the PHY using ``phy_mii_ioctl()``. 678*4882a593Smuzhiyun 679*4882a593Smuzhiyun- On RX, special intervention may or may not be needed, depending on the 680*4882a593Smuzhiyun function used to deliver skb's up the network stack. In the case of plain 681*4882a593Smuzhiyun ``netif_rx()`` and similar, MAC drivers must check whether 682*4882a593Smuzhiyun ``skb_defer_rx_timestamp(skb)`` is necessary or not - and if it is, don't 683*4882a593Smuzhiyun call ``netif_rx()`` at all. If ``CONFIG_NETWORK_PHY_TIMESTAMPING`` is 684*4882a593Smuzhiyun enabled, and ``skb->dev->phydev->mii_ts`` exists, its ``.rxtstamp()`` hook 685*4882a593Smuzhiyun will be called now, to determine, using logic very similar to DSA, whether 686*4882a593Smuzhiyun deferral for RX timestamping is necessary. Again like DSA, it becomes the 687*4882a593Smuzhiyun responsibility of the PHY driver to send the packet up the stack when the 688*4882a593Smuzhiyun timestamp is available. 689*4882a593Smuzhiyun 690*4882a593Smuzhiyun For other skb receive functions, such as ``napi_gro_receive`` and 691*4882a593Smuzhiyun ``netif_receive_skb``, the stack automatically checks whether 692*4882a593Smuzhiyun ``skb_defer_rx_timestamp()`` is necessary, so this check is not needed inside 693*4882a593Smuzhiyun the driver. 694*4882a593Smuzhiyun 695*4882a593Smuzhiyun- On TX, again, special intervention might or might not be needed. The 696*4882a593Smuzhiyun function that calls the ``mii_ts->txtstamp()`` hook is named 697*4882a593Smuzhiyun ``skb_clone_tx_timestamp()``. This function can either be called directly 698*4882a593Smuzhiyun (case in which explicit MAC driver support is indeed needed), but the 699*4882a593Smuzhiyun function also piggybacks from the ``skb_tx_timestamp()`` call, which many MAC 700*4882a593Smuzhiyun drivers already perform for software timestamping purposes. Therefore, if a 701*4882a593Smuzhiyun MAC supports software timestamping, it does not need to do anything further 702*4882a593Smuzhiyun at this stage. 703*4882a593Smuzhiyun 704*4882a593Smuzhiyun3.2.3 MII bus snooping devices 705*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 706*4882a593Smuzhiyun 707*4882a593SmuzhiyunThese perform the same role as timestamping Ethernet PHYs, save for the fact 708*4882a593Smuzhiyunthat they are discrete devices and can therefore be used in conjunction with 709*4882a593Smuzhiyunany PHY even if it doesn't support timestamping. In Linux, they are 710*4882a593Smuzhiyundiscoverable and attachable to a ``struct phy_device`` through Device Tree, and 711*4882a593Smuzhiyunfor the rest, they use the same mii_ts infrastructure as those. See 712*4882a593SmuzhiyunDocumentation/devicetree/bindings/ptp/timestamper.txt for more details. 713*4882a593Smuzhiyun 714*4882a593Smuzhiyun3.2.4 Other caveats for MAC drivers 715*4882a593Smuzhiyun^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 716*4882a593Smuzhiyun 717*4882a593SmuzhiyunStacked PHCs, especially DSA (but not only) - since that doesn't require any 718*4882a593Smuzhiyunmodification to MAC drivers, so it is more difficult to ensure correctness of 719*4882a593Smuzhiyunall possible code paths - is that they uncover bugs which were impossible to 720*4882a593Smuzhiyuntrigger before the existence of stacked PTP clocks. One example has to do with 721*4882a593Smuzhiyunthis line of code, already presented earlier:: 722*4882a593Smuzhiyun 723*4882a593Smuzhiyun skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; 724*4882a593Smuzhiyun 725*4882a593SmuzhiyunAny TX timestamping logic, be it a plain MAC driver, a DSA switch driver, a PHY 726*4882a593Smuzhiyundriver or a MII bus snooping device driver, should set this flag. 727*4882a593SmuzhiyunBut a MAC driver that is unaware of PHC stacking might get tripped up by 728*4882a593Smuzhiyunsomebody other than itself setting this flag, and deliver a duplicate 729*4882a593Smuzhiyuntimestamp. 730*4882a593SmuzhiyunFor example, a typical driver design for TX timestamping might be to split the 731*4882a593Smuzhiyuntransmission part into 2 portions: 732*4882a593Smuzhiyun 733*4882a593Smuzhiyun1. "TX": checks whether PTP timestamping has been previously enabled through 734*4882a593Smuzhiyun the ``.ndo_do_ioctl`` ("``priv->hwtstamp_tx_enabled == true``") and the 735*4882a593Smuzhiyun current skb requires a TX timestamp ("``skb_shinfo(skb)->tx_flags & 736*4882a593Smuzhiyun SKBTX_HW_TSTAMP``"). If this is true, it sets the 737*4882a593Smuzhiyun "``skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS``" flag. Note: as 738*4882a593Smuzhiyun described above, in the case of a stacked PHC system, this condition should 739*4882a593Smuzhiyun never trigger, as this MAC is certainly not the outermost PHC. But this is 740*4882a593Smuzhiyun not where the typical issue is. Transmission proceeds with this packet. 741*4882a593Smuzhiyun 742*4882a593Smuzhiyun2. "TX confirmation": Transmission has finished. The driver checks whether it 743*4882a593Smuzhiyun is necessary to collect any TX timestamp for it. Here is where the typical 744*4882a593Smuzhiyun issues are: the MAC driver takes a shortcut and only checks whether 745*4882a593Smuzhiyun "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``" was set. With a stacked 746*4882a593Smuzhiyun PHC system, this is incorrect because this MAC driver is not the only entity 747*4882a593Smuzhiyun in the TX data path who could have enabled SKBTX_IN_PROGRESS in the first 748*4882a593Smuzhiyun place. 749*4882a593Smuzhiyun 750*4882a593SmuzhiyunThe correct solution for this problem is for MAC drivers to have a compound 751*4882a593Smuzhiyuncheck in their "TX confirmation" portion, not only for 752*4882a593Smuzhiyun"``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``", but also for 753*4882a593Smuzhiyun"``priv->hwtstamp_tx_enabled == true``". Because the rest of the system ensures 754*4882a593Smuzhiyunthat PTP timestamping is not enabled for anything other than the outermost PHC, 755*4882a593Smuzhiyunthis enhanced check will avoid delivering a duplicated TX timestamp to user 756*4882a593Smuzhiyunspace. 757