1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun===================================== 4*4882a593SmuzhiyunNetwork Devices, the Kernel, and You! 5*4882a593Smuzhiyun===================================== 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunIntroduction 9*4882a593Smuzhiyun============ 10*4882a593SmuzhiyunThe following is a random collection of documentation regarding 11*4882a593Smuzhiyunnetwork devices. 12*4882a593Smuzhiyun 13*4882a593Smuzhiyunstruct net_device lifetime rules 14*4882a593Smuzhiyun================================ 15*4882a593SmuzhiyunNetwork device structures need to persist even after module is unloaded and 16*4882a593Smuzhiyunmust be allocated with alloc_netdev_mqs() and friends. 17*4882a593SmuzhiyunIf device has registered successfully, it will be freed on last use 18*4882a593Smuzhiyunby free_netdev(). This is required to handle the pathological case cleanly 19*4882a593Smuzhiyun(example: ``rmmod mydriver </sys/class/net/myeth/mtu``) 20*4882a593Smuzhiyun 21*4882a593Smuzhiyunalloc_netdev_mqs() / alloc_netdev() reserve extra space for driver 22*4882a593Smuzhiyunprivate data which gets freed when the network device is freed. If 23*4882a593Smuzhiyunseparately allocated data is attached to the network device 24*4882a593Smuzhiyun(netdev_priv()) then it is up to the module exit handler to free that. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunThere are two groups of APIs for registering struct net_device. 27*4882a593SmuzhiyunFirst group can be used in normal contexts where ``rtnl_lock`` is not already 28*4882a593Smuzhiyunheld: register_netdev(), unregister_netdev(). 29*4882a593SmuzhiyunSecond group can be used when ``rtnl_lock`` is already held: 30*4882a593Smuzhiyunregister_netdevice(), unregister_netdevice(), free_netdevice(). 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunSimple drivers 33*4882a593Smuzhiyun-------------- 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunMost drivers (especially device drivers) handle lifetime of struct net_device 36*4882a593Smuzhiyunin context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths). 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunIn that case the struct net_device registration is done using 39*4882a593Smuzhiyunthe register_netdev(), and unregister_netdev() functions: 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun.. code-block:: c 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun int probe() 44*4882a593Smuzhiyun { 45*4882a593Smuzhiyun struct my_device_priv *priv; 46*4882a593Smuzhiyun int err; 47*4882a593Smuzhiyun 48*4882a593Smuzhiyun dev = alloc_netdev_mqs(...); 49*4882a593Smuzhiyun if (!dev) 50*4882a593Smuzhiyun return -ENOMEM; 51*4882a593Smuzhiyun priv = netdev_priv(dev); 52*4882a593Smuzhiyun 53*4882a593Smuzhiyun /* ... do all device setup before calling register_netdev() ... 54*4882a593Smuzhiyun */ 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun err = register_netdev(dev); 57*4882a593Smuzhiyun if (err) 58*4882a593Smuzhiyun goto err_undo; 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun /* net_device is visible to the user! */ 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun err_undo: 63*4882a593Smuzhiyun /* ... undo the device setup ... */ 64*4882a593Smuzhiyun free_netdev(dev); 65*4882a593Smuzhiyun return err; 66*4882a593Smuzhiyun } 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun void remove() 69*4882a593Smuzhiyun { 70*4882a593Smuzhiyun unregister_netdev(dev); 71*4882a593Smuzhiyun free_netdev(dev); 72*4882a593Smuzhiyun } 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunNote that after calling register_netdev() the device is visible in the system. 75*4882a593SmuzhiyunUsers can open it and start sending / receiving traffic immediately, 76*4882a593Smuzhiyunor run any other callback, so all initialization must be done prior to 77*4882a593Smuzhiyunregistration. 78*4882a593Smuzhiyun 79*4882a593Smuzhiyununregister_netdev() closes the device and waits for all users to be done 80*4882a593Smuzhiyunwith it. The memory of struct net_device itself may still be referenced 81*4882a593Smuzhiyunby sysfs but all operations on that device will fail. 82*4882a593Smuzhiyun 83*4882a593Smuzhiyunfree_netdev() can be called after unregister_netdev() returns on when 84*4882a593Smuzhiyunregister_netdev() failed. 85*4882a593Smuzhiyun 86*4882a593SmuzhiyunDevice management under RTNL 87*4882a593Smuzhiyun---------------------------- 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunRegistering struct net_device while in context which already holds 90*4882a593Smuzhiyunthe ``rtnl_lock`` requires extra care. In those scenarios most drivers 91*4882a593Smuzhiyunwill want to make use of struct net_device's ``needs_free_netdev`` 92*4882a593Smuzhiyunand ``priv_destructor`` members for freeing of state. 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunExample flow of netdev handling under ``rtnl_lock``: 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun.. code-block:: c 97*4882a593Smuzhiyun 98*4882a593Smuzhiyun static void my_setup(struct net_device *dev) 99*4882a593Smuzhiyun { 100*4882a593Smuzhiyun dev->needs_free_netdev = true; 101*4882a593Smuzhiyun } 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun static void my_destructor(struct net_device *dev) 104*4882a593Smuzhiyun { 105*4882a593Smuzhiyun some_obj_destroy(priv->obj); 106*4882a593Smuzhiyun some_uninit(priv); 107*4882a593Smuzhiyun } 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun int create_link() 110*4882a593Smuzhiyun { 111*4882a593Smuzhiyun struct my_device_priv *priv; 112*4882a593Smuzhiyun int err; 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun ASSERT_RTNL(); 115*4882a593Smuzhiyun 116*4882a593Smuzhiyun dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup); 117*4882a593Smuzhiyun if (!dev) 118*4882a593Smuzhiyun return -ENOMEM; 119*4882a593Smuzhiyun priv = netdev_priv(dev); 120*4882a593Smuzhiyun 121*4882a593Smuzhiyun /* Implicit constructor */ 122*4882a593Smuzhiyun err = some_init(priv); 123*4882a593Smuzhiyun if (err) 124*4882a593Smuzhiyun goto err_free_dev; 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun priv->obj = some_obj_create(); 127*4882a593Smuzhiyun if (!priv->obj) { 128*4882a593Smuzhiyun err = -ENOMEM; 129*4882a593Smuzhiyun goto err_some_uninit; 130*4882a593Smuzhiyun } 131*4882a593Smuzhiyun /* End of constructor, set the destructor: */ 132*4882a593Smuzhiyun dev->priv_destructor = my_destructor; 133*4882a593Smuzhiyun 134*4882a593Smuzhiyun err = register_netdevice(dev); 135*4882a593Smuzhiyun if (err) 136*4882a593Smuzhiyun /* register_netdevice() calls destructor on failure */ 137*4882a593Smuzhiyun goto err_free_dev; 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun /* If anything fails now unregister_netdevice() (or unregister_netdev()) 140*4882a593Smuzhiyun * will take care of calling my_destructor and free_netdev(). 141*4882a593Smuzhiyun */ 142*4882a593Smuzhiyun 143*4882a593Smuzhiyun return 0; 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun err_some_uninit: 146*4882a593Smuzhiyun some_uninit(priv); 147*4882a593Smuzhiyun err_free_dev: 148*4882a593Smuzhiyun free_netdev(dev); 149*4882a593Smuzhiyun return err; 150*4882a593Smuzhiyun } 151*4882a593Smuzhiyun 152*4882a593SmuzhiyunIf struct net_device.priv_destructor is set it will be called by the core 153*4882a593Smuzhiyunsome time after unregister_netdevice(), it will also be called if 154*4882a593Smuzhiyunregister_netdevice() fails. The callback may be invoked with or without 155*4882a593Smuzhiyun``rtnl_lock`` held. 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunThere is no explicit constructor callback, driver "constructs" the private 158*4882a593Smuzhiyunnetdev state after allocating it and before registration. 159*4882a593Smuzhiyun 160*4882a593SmuzhiyunSetting struct net_device.needs_free_netdev makes core call free_netdevice() 161*4882a593Smuzhiyunautomatically after unregister_netdevice() when all references to the device 162*4882a593Smuzhiyunare gone. It only takes effect after a successful call to register_netdevice() 163*4882a593Smuzhiyunso if register_netdevice() fails driver is responsible for calling 164*4882a593Smuzhiyunfree_netdev(). 165*4882a593Smuzhiyun 166*4882a593Smuzhiyunfree_netdev() is safe to call on error paths right after unregister_netdevice() 167*4882a593Smuzhiyunor when register_netdevice() fails. Parts of netdev (de)registration process 168*4882a593Smuzhiyunhappen after ``rtnl_lock`` is released, therefore in those cases free_netdev() 169*4882a593Smuzhiyunwill defer some of the processing until ``rtnl_lock`` is released. 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunDevices spawned from struct rtnl_link_ops should never free the 172*4882a593Smuzhiyunstruct net_device directly. 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun.ndo_init and .ndo_uninit 175*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~ 176*4882a593Smuzhiyun 177*4882a593Smuzhiyun``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device 178*4882a593Smuzhiyunregistration and de-registration, under ``rtnl_lock``. Drivers can use 179*4882a593Smuzhiyunthose e.g. when parts of their init process need to run under ``rtnl_lock``. 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit`` 182*4882a593Smuzhiyunruns during de-registering after device is closed but other subsystems 183*4882a593Smuzhiyunmay still have outstanding references to the netdevice. 184*4882a593Smuzhiyun 185*4882a593SmuzhiyunMTU 186*4882a593Smuzhiyun=== 187*4882a593SmuzhiyunEach network device has a Maximum Transfer Unit. The MTU does not 188*4882a593Smuzhiyuninclude any link layer protocol overhead. Upper layer protocols must 189*4882a593Smuzhiyunnot pass a socket buffer (skb) to a device to transmit with more data 190*4882a593Smuzhiyunthan the mtu. The MTU does not include link layer header overhead, so 191*4882a593Smuzhiyunfor example on Ethernet if the standard MTU is 1500 bytes used, the 192*4882a593Smuzhiyunactual skb will contain up to 1514 bytes because of the Ethernet 193*4882a593Smuzhiyunheader. Devices should allow for the 4 byte VLAN header as well. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunSegmentation Offload (GSO, TSO) is an exception to this rule. The 196*4882a593Smuzhiyunupper layer protocol may pass a large socket buffer to the device 197*4882a593Smuzhiyuntransmit routine, and the device will break that up into separate 198*4882a593Smuzhiyunpackets based on the current MTU. 199*4882a593Smuzhiyun 200*4882a593SmuzhiyunMTU is symmetrical and applies both to receive and transmit. A device 201*4882a593Smuzhiyunmust be able to receive at least the maximum size packet allowed by 202*4882a593Smuzhiyunthe MTU. A network device may use the MTU as mechanism to size receive 203*4882a593Smuzhiyunbuffers, but the device should allow packets with VLAN header. With 204*4882a593Smuzhiyunstandard Ethernet mtu of 1500 bytes, the device should allow up to 205*4882a593Smuzhiyun1518 byte packets (1500 + 14 header + 4 tag). The device may either: 206*4882a593Smuzhiyundrop, truncate, or pass up oversize packets, but dropping oversize 207*4882a593Smuzhiyunpackets is preferred. 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun 210*4882a593Smuzhiyunstruct net_device synchronization rules 211*4882a593Smuzhiyun======================================= 212*4882a593Smuzhiyunndo_open: 213*4882a593Smuzhiyun Synchronization: rtnl_lock() semaphore. 214*4882a593Smuzhiyun Context: process 215*4882a593Smuzhiyun 216*4882a593Smuzhiyunndo_stop: 217*4882a593Smuzhiyun Synchronization: rtnl_lock() semaphore. 218*4882a593Smuzhiyun Context: process 219*4882a593Smuzhiyun Note: netif_running() is guaranteed false 220*4882a593Smuzhiyun 221*4882a593Smuzhiyunndo_do_ioctl: 222*4882a593Smuzhiyun Synchronization: rtnl_lock() semaphore. 223*4882a593Smuzhiyun Context: process 224*4882a593Smuzhiyun 225*4882a593Smuzhiyunndo_get_stats: 226*4882a593Smuzhiyun Synchronization: dev_base_lock rwlock. 227*4882a593Smuzhiyun Context: nominally process, but don't sleep inside an rwlock 228*4882a593Smuzhiyun 229*4882a593Smuzhiyunndo_start_xmit: 230*4882a593Smuzhiyun Synchronization: __netif_tx_lock spinlock. 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun When the driver sets NETIF_F_LLTX in dev->features this will be 233*4882a593Smuzhiyun called without holding netif_tx_lock. In this case the driver 234*4882a593Smuzhiyun has to lock by itself when needed. 235*4882a593Smuzhiyun The locking there should also properly protect against 236*4882a593Smuzhiyun set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated. 237*4882a593Smuzhiyun Don't use it for new drivers. 238*4882a593Smuzhiyun 239*4882a593Smuzhiyun Context: Process with BHs disabled or BH (timer), 240*4882a593Smuzhiyun will be called with interrupts disabled by netconsole. 241*4882a593Smuzhiyun 242*4882a593Smuzhiyun Return codes: 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun * NETDEV_TX_OK everything ok. 245*4882a593Smuzhiyun * NETDEV_TX_BUSY Cannot transmit packet, try later 246*4882a593Smuzhiyun Usually a bug, means queue start/stop flow control is broken in 247*4882a593Smuzhiyun the driver. Note: the driver must NOT put the skb in its DMA ring. 248*4882a593Smuzhiyun 249*4882a593Smuzhiyunndo_tx_timeout: 250*4882a593Smuzhiyun Synchronization: netif_tx_lock spinlock; all TX queues frozen. 251*4882a593Smuzhiyun Context: BHs disabled 252*4882a593Smuzhiyun Notes: netif_queue_stopped() is guaranteed true 253*4882a593Smuzhiyun 254*4882a593Smuzhiyunndo_set_rx_mode: 255*4882a593Smuzhiyun Synchronization: netif_addr_lock spinlock. 256*4882a593Smuzhiyun Context: BHs disabled 257*4882a593Smuzhiyun 258*4882a593Smuzhiyunstruct napi_struct synchronization rules 259*4882a593Smuzhiyun======================================== 260*4882a593Smuzhiyunnapi->poll: 261*4882a593Smuzhiyun Synchronization: 262*4882a593Smuzhiyun NAPI_STATE_SCHED bit in napi->state. Device 263*4882a593Smuzhiyun driver's ndo_stop method will invoke napi_disable() on 264*4882a593Smuzhiyun all NAPI instances which will do a sleeping poll on the 265*4882a593Smuzhiyun NAPI_STATE_SCHED napi->state bit, waiting for all pending 266*4882a593Smuzhiyun NAPI activity to cease. 267*4882a593Smuzhiyun 268*4882a593Smuzhiyun Context: 269*4882a593Smuzhiyun softirq 270*4882a593Smuzhiyun will be called with interrupts disabled by netconsole. 271