Thursday, 21 May 2020

How does network device driver works

High level overview of the path of a packet:
  1. Driver is loaded and initialized.
  2. Packet arrives at the NIC from the network.
  3. Packet is copied (via DMA) to a ring buffer in kernel memory.
  4. Hardware interrupt is generated to let the system know a packet is in memory.
  5. Driver calls into NAPI to start a poll loop if one was not running already.
  6. ksoftirqd processes run on each CPU on the system. They are registered at boot time. The ksoftirqd processes pull packets off the ring buffer by calling the NAPI poll function that the device driver registered during initialization.
  7. Memory regions in the ring buffer that have had network data written to them are unmapped.
  8. Data that was DMA’d into memory is passed up the networking layer as an ‘skb’ for more processing.
  9. Packet steering happens to distribute packet processing load to multiple CPUs (in leu of a NIC with multiple receive queues), if enabled.
  10. Packets are handed to the protocol layers from the queues.
  11. Protocol layers add them to receive buffers attached to sockets.

   Network Device Driver:


A driver registers an initialization function which is called by the kernel when the driver is loaded. This function is registered by using the module_init macro.

The e1000e initialization function (e1000_init_module) and its registration with module_init can be found in drivers/net/ethernet/intel/e1000e/netdev.c

 *  e1000_init_module - Driver Registration Routine
 *  e1000_init_module is the first routine called when the driver is
 *  loaded. All it does is register with the PCI subsystem.
static int __init e1000_init_module(void)
  int ret;
  pr_info("%s - version %s\n", e1000e_driver_string, e1000e_driver_version);
  pr_info("%s\n", e1000e_copyright);

  /* ... */

  ret = pci_register_driver(&e1000_driver);
  return ret;


PCI initialization:

The Intel network card is a PCI express device.
PCI devices identify themselves with a series of registers in the PCI Configuration Space.
When a device driver is compiled, a macro named MODULE_DEVICE_TABLE (from include/module.h) is used to export a table of PCI device IDs identifying devices that the device driver can control. The table is also registered as part of a structure, as we’ll see shortly.
The kernel uses this table to determine which device driver to load to control the device.
That’s how the OS can figure out which devices are connected to the system and which driver should be used to talk to the device.
This table and the PCI device IDs for the e1000e driver can be found in drivers/net/ethernet/intel/e1000e/netdev.c and drivers/net/ethernet/intel//e1000e/e1000.h, respectively:

static DEFINE_PCI_DEVICE_TABLE(e1000_pci_tbl) = {
  { PCI_VDEVICE(INTEL, E1000_DEV_ID_82571EB_COPPER), board_82571 },
  { PCI_VDEVICE(INTEL, E1000_DEV_ID_82571EB_FIBER), board_825751 },
  /* ... */
MODULE_DEVICE_TABLE(pci, e1000_pci_tbl);

As seen in the previous section, pci_register_driver is called in the driver’s initialization function.
This function registers a structure of pointers. Most of the pointers are function pointers, but the PCI device ID table is also registered. The kernel uses the functions registered by the driver to bring the PCI device up.
static struct pci_driver e1000_driver = {
  .name     = e1000e_driver_name,
  .id_table = e1000_pci_tbl,
  .probe    = e1000_probe,
  .remove   = e1000_remove,

  /* ... */

PCI probe

Once a device has been identified by its PCI IDs, the kernel can then select the proper driver to use to control the device. Each PCI driver registers a probe function with the PCI system in the kernel. The kernel calls this function for devices which have not yet been claimed by a device driver. Once a device is claimed, other drivers will not be asked about the device. Most drivers have a lot of code that runs to get the device ready for use. The exact things done vary from driver to driver.
Some typical operations to perform include:
  1. Enabling the PCI device.
  2. Requesting memory ranges and IO ports.
  3. Setting the DMA mask.
  4. The ethtool (described more below) functions the driver supports are registered.
  5. Any watchdog tasks needed (for example, e1000e has a watchdog task to check if the hardware is hung).
  6. Other device specific stuff like workarounds or dealing with hardware specific quirks or similar.
  7. The creation, initialization, and registration of a struct net_device_ops structure. This structure contains function pointers to the various functions needed for opening the device, sending data to the network, setting the MAC address, and more.
  8. The creation, initialization, and registration of a high level struct net_device which represents a network device.
Let’s take a quick look at some of these operations in the igb driver in the function igb_probe.
A peek into PCI initialization
The following code from the igb_probe function does some basic PCI configuration. From drivers/net/ethernet/intel/igb/igb_main.c:
err = pci_enable_device_mem(pdev);

/* ... */

err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));

/* ... */

bars = pci_select_bars(pdev,IORESOURCE_MEM);
err = pci_request_selected_regions_exclusive(pdve,bars,drive_name);


First, the device is initialized with pci_enable_device_mem. This will wake up the device if it is suspended, enable memory resources, and more.
Next, the DMA mask will be set. This device can read and write to 64bit memory addresses, so dma_set_mask_and_coherent is called with DMA_BIT_MASK(64).
Memory regions will be reserved with a call to pci_request_selected_regionsPCI Express Advanced Error Reporting is enabled (if the PCI AER driver is loaded), DMA is enabled with a call to pci_set_master, and the PCI configuration space is saved with a call to pci_save_state.

Network device initialization

The igb_probe function does some important network device initialization. In addition to the PCI specific work, it will do more general networking and network device work:
  1. The allocate_etherdev is allocate Ethernet device.
  2. The struct net_device_ops is registered.
  3. ethtool operations are registered.
  4. The default MAC address is obtained from the NIC.
  5. net_device feature flags are set.
  6. And lots more.

struct net_device_ops:

The struct net_device_ops contains function pointers to lots of important operations that the network subsystem needs to control the device. We’ll be mentioning this structure many times throughout the rest of this post.

This net_device_ops structure is attached to a struct net_device in e1000_probe. From drivers/net/ethernet/intel/e1000e/netdev.c)

static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
  /* ... */

  netdev->netdev_ops = &e1000e_netdev_ops;

And the functions that this net_device_ops structure holds pointers to are set in the same file. From drivers/net/ethernet/intel/e1000e/netdev.c:

static const struct net_device_ops e1000e_netdev_ops = {
  .ndo_open               = e1000_open,
  .ndo_stop               = e1000_close,
  .ndo_start_xmit         = e1000_xmit_frame,
  .ndo_get_stats64        = e1000_get_stats64,
  .ndo_set_rx_mode        = e1000_set_rx_mode,
  .ndo_set_mac_address    = e1000_set_mac,
  .ndo_change_mtu         = e1000_change_mtu,
  .ndo_do_ioctl           = e1000_ioctl,

  /* ... */

As you can see, there are several interesting fields in this struct like ndo_openndo_stopndo_start_xmit, and ndo_get_stats64 which hold the addresses of functions implemented by the e1000e driver.

ethtool registrations:

ethtool is a command line program you can use to get and set various driver and hardware options. You can install it on Ubuntu by running apt-get install ethtool.
A common use of ethtool is to gather detailed statistics from network devices. Other ethtool settings of interest will be described later.
The ethtool program talks to device drivers by using the ioctl system call. The device drivers register a series of functions that run for the ethtool operations and the kernel provides the glue.
When an ioctl call is made from ethtool, the kernel finds the ethtool structure registered by the appropriate driver and executes the functions registered. The driver’s ethtool function implementation can do anything from change a simple software flag in the driver to adjusting how the actual NIC hardware works by writing register values to the device.
The e1000 driver registers its ethtool operations in e1000_probe by calling e1000_set_ethtool_ops:

static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
  /* ... */


All of the igb driver’s ethtool code can be found in the file drivers/net/ethernet/intel/e1000e/ethtool.c along with the e1000e_set_ethtool_ops function.

void e1000e_set_ethtool_ops(struct net_device *netdev)
  SET_ETHTOOL_OPS(netdev, &e1000e_ethtool_ops);

Above that, you can find the e1000e_ethtool_ops structure with the ethtool functions the e1000e driver supports set to the appropriate fields.

static const struct ethtool_ops e1000_ethtool_ops = {
  .get_settings           = e1000_get_settings,
  .set_settings           = e1000_set_settings,
  .get_drvinfo            = e1000_get_drvinfo,
  .get_regs_len           = e1000_get_regs_len,
  .get_regs               = e1000_get_regs,
  /* ... */

It is up to the individual drivers to determine which ethtool functions are relevant and which should be implemented. Not all drivers implement all ethtool functions, unfortunately.
One interesting ethtool function is e1000_ethtool_stats, which (if implemented) produces detailed statistics counters that are tracked either in software in the driver or via the device itself.

The monitoring section below will show how to use ethtool to access these detailed statistics.

When a data frame is written to RAM via DMA, how does the NIC tell the rest of the system that data is ready to be processed?
Traditionally, a NIC would generate an interrupt request (IRQ) indicating data had arrived. There are three common types of IRQs: MSI-X, MSI, and legacy IRQs. These will be touched upon shortly. A device generating an IRQ when data has been written to RAM via DMA is simple enough, but if large numbers of data frames arrive this can lead to a large number of IRQs being generated. The more IRQs that are generated, the less CPU time is available for higher level tasks like user processes.
The New Api (NAPI) was created as a mechanism for reducing the number of IRQs generated by network devices on packet arrival. While NAPI reduces the number of IRQs, it cannot eliminate them completely.

NAPI differs from the legacy method of harvesting data in several important ways. NAPI allows a device driver to register a poll function that the NAPI subsystem will call to harvest data frames.
The intended use of NAPI in network device drivers is as follows:
  1. NAPI is enabled by the driver, but is in the off position initially.
  2. A packet arrives and is DMA’d to memory by the NIC.
  3. An IRQ is generated by the NIC which triggers the IRQ handler in the driver.
  4. The driver wakes up the NAPI subsystem using a softirq (more on these later). This will begin harvesting packets by calling the driver’s registered poll function in a separate thread of execution.
  5. The driver should disable further IRQs from the NIC. This is done to allow the NAPI subsystem to process packets without interruption from the device.
  6. Once there is no more work to do, the NAPI subsystem is disabled and IRQs from the device are re-enabled.
  7. The process starts back at step 2.
This method of gathering data frames has reduced overhead compared to the legacy method because many data frames can be consumed at a time without having to deal with processing each of them one IRQ at a time.
The device driver implements a poll function and registers it with NAPI by calling netif_napi_add. When registering a NAPI poll function with netif_napi_add, the driver will also specify the weight. Most of the drivers hardcode a value of 64. This value and its meaning will be described in more detail below.
Typically, drivers register their NAPI poll functions during driver initialization.

NAPI initialization in the e1000e driver

The e1000e driver does this via a long call chain:
  1. e1000_probe calls e1000_sw_init.
  2. e1000_sw_init calls e1000_set_interrupt_capability.
  3. e1000_set_interrupt_capability calls MSI-X or MSI or Legacy .
This call trace results in a few high level things happening:
  1. If MSI-X is supported, it will be enabled with a call to pci_enable_msix.
  2. Various settings are computed and initialized; most notably the number of transmit and receive queues that the device and driver will use for sending and receiving packets.
static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *entx)

  /* initialize NAPI */
  netif_napi_add(netdev, &adapter->napi, e1000e_poll, 64);

Bringing a network device up

Recall the net_device_ops structure we saw earlier which registered a set of functions for bringing the network device up, transmitting packets, setting the MAC address, etc.
When a network device is brought up (for example, with ifconfig eth0 up), the function attached to the ndo_open field of the net_device_ops structure is called.
The ndo_open function will typically do things like:
  1. Allocate RX and TX queue memory
  2. Enable NAPI
  3. Register an interrupt handler
  4. Enable hardware interrupts
  5. And more.

In the case of the e1000e driver, the function attached to the ndo_open field of the net_device_ops structure is called e1000_open.
Preparing to receive data from the network
Most NICs will use DMA to write data directly into RAM where the OS can retrieve the data for processing. The data structure most NICs use for this purpose resembles a queue built on circular buffer (or a ring buffer).
In order to do this, the device driver must work with the OS to reserve a region of memory that the NIC hardware can use. Once this region is reserved, the hardware is informed of its location and incoming data will be written to RAM where it will later be picked up and processed by the networking subsystem.
This seems simple enough, but what if the packet rate was high enough that a single CPU was not able to properly process all incoming packets? The data structure is built on a fixed length region of memory, so incoming packets would be dropped.
This is where something known as known as Receive Side Scaling (RSS) or multiqueue can help.
Some devices have the ability to write incoming packets to several different regions of RAM simultaneously; each region is a separate queue. This allows the OS to use multiple CPUs to process incoming data in parallel, starting at the hardware level. This feature is not supported by all NICs.
The Intel 82751EB NIC does support multiple queues. We can see evidence of this in the e1000e driver. One of the first things the e1000e driver does when it is brought up is call a function named e1000e_setup_all_rx_resources. This function calls another function, e1000e_setup_all_rx_resources, once for each RX queue to arrange for DMA-able memory where the device will write incoming data.
If you are curious how exactly this works, please see the Linux kernel’s DMA API HOWTO.
It turns out the number and size of the RX queues can be tuned by using ethtool. Tuning these values can have a noticeable impact on the number of frames which are processed vs the number of frames which are dropped.
The NIC uses a hash function on the packet header fields (like source, destination, port, etc) to determine which RX queue the data should be directed to.
Some NICs let you adjust the weight of the RX queues, so you can send more traffic to specific queues.
Fewer NICs let you adjust this hash function itself. If you can adjust the hash function, you can send certain flows to specific RX queues for processing or even drop the packets at the hardware level, if desired.
Enable NAPI
When a network device is brought up, a driver will usually enable NAPI.
We saw earlier how drivers register poll functions with NAPI, but NAPI is not usually enabled until the device is brought up.
Enabling NAPI is relatively straight forward. A call to napi_enable will flip a bit in the struct napi_struct to indicate that it is now enabled. As mentioned above, while NAPI will be enabled it will be in the off position.
Register an interrupt handler
After enabling NAPI, the next step is to register an interrupt handler. There are different methods a device can use to signal an interrupt: MSI-X, MSI, and legacy interrupts. As such, the code differs from device to device depending on what the supported interrupt methods are for a particular piece of hardware.
The driver must determine which method is supported by the device and register the appropriate handler function that will execute when the interrupt is received.
Some drivers, like the e1000e driver, will try to register an interrupt handler with each method, falling back to the next untested method on failure.
MSI-X interrupts are the preferred method, especially for NICs that support multiple RX queues. This is because each RX queue can have its own hardware interrupt assigned, which can then be handled by a specific CPU (with irqbalance or by modifying /proc/irq/IRQ_NUMBER/smp_affinity). As we’ll see shortly, the CPU that handles the interrupt will be the CPU that processes the packet. In this way, arriving packets can be processed by separate CPUs from the hardware interrupt level up through the networking stack.
If MSI-X is unavailable, MSI still presents advantages over legacy interrupts and will be used by the driver if the device supports it. Read this useful wiki page for more information about MSI and MSI-X.
In the e1000e driver, the functions e1000_request_msixe1000_intr_msi
e1000_intr are the interrupt handler methods for the MSI-X, MSI, and legacy interrupt modes, respectively.
You can find the code in the driver which attempts each interrupt method in drivers/net/ethernet/intel/igb/e1000e/netdev.c:

static int e1000_request_irq(struct igb_adapter *adapter)
  struct net_device *netdev = adapter->netdev;
  struct pci_dev *pdev = adapter->pdev;
  int err = 0;

  if (adapter->msix_entries) {
    err = e1000_request_msix(adapter);
    if (!err)
      goto request_done;
    /* fall back to MSI */

    /* ... */

  /* ... */

  if (adapter->flags & IGB_FLAG_HAS_MSI) {
    err = request_irq(pdev->irq, e1000_intr_msi, 0,
          netdev->name, adapter);
    if (!err)
      goto request_done;

    /* fall back to legacy interrupts */

    /* ... */

  err = request_irq(pdev->irq, e1000_intr, IRQF_SHARED,
        netdev->name, adapter);

  if (err)
    dev_err(&pdev->dev, "Error %d getting interrupt\n", err);

  return err;

As you can see in the abbreviated code above, the driver first attempts to set an MSI-X interrupt handler with e1000_request_msix, falling back to MSI on failure. Next, request_irq is used to register e1000_intr_msi, the MSI interrupt handler. If this fails, the driver falls back to legacy interrupts. request_irq is used again to register the legacy interrupt handler e1000_intr.
And this is how the e1000 driver registers a function that will be executed when the NIC raises an interrupt signaling that data has arrived and is ready for processing.
Enable Interrupts
At this point, almost everything is setup. The only thing left is to enable interrupts from the NIC and wait for data to arrive. Enabling interrupts is hardware specific, but the e1000e driver does this in e1000_open by calling a helper function named e1000_irq_enable.
Interrupts are enabled for this device by writing to registers:
static void e1000_irq_enable(struct e1000_adapter *adapter)

  /* ... */

    ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
    ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | IMS_OTHER_MASK);

  /* ... */

The network device is now up
Drivers may do a few more things like start timers, work queues, or other hardware-specific setup. Once that is completed. the network device is up and ready for use.


Before examining the network stack, we’ll need to take a short detour to examine something in the Linux kernel called SoftIRQs.

What is a softirq?

The softirq system in the Linux kernel is a mechanism for executing code outside of the context of an interrupt handler implemented in a driver. This system is important because hardware interrupts may be disabled during all or part of the execution of an interrupt handler. The longer interrupts are disabled, the greater chance that events may be missed. So, it is important to defer any long running actions outside of the interrupt handler so that it can complete as quickly as possible and re-enable interrupts from the device.
There are other mechanisms that can be used for deferring work in the kernel, but for the purposes of the networking stack, we’ll be looking at softirqs.
The softirq system can be imagined as a series of kernel threads (one per CPU) that run handler functions which have been registered for different softirq events. If you’ve ever looked at top and seen ksoftirqd/0 in the list of kernel threads, you were looking at the softirq kernel thread running on CPU 0.
Kernel subsystems (like networking) can register a softirq handler by executing the open_softirq function. We’ll see later how the networking system registers its softirq handlers. For now, let’s learn a bit more about how softirqs work.


Since softirqs are so important for deferring the work of device drivers, you might imagine that the ksoftirqd process is spawned pretty early in the life cycle of the kernel and you’d be correct.
Looking at the code found in kernel/softirq.c reveals how the ksoftirqd system is initialized:
static struct smp_hotplug_thread softirq_threads = {
  .store              = &ksoftirqd,
  .thread_should_run  = ksoftirqd_should_run,
  .thread_fn          = run_ksoftirqd,
  .thread_comm        = "ksoftirqd/%u",

static __init int spawn_ksoftirqd(void)


  return 0;
As you can see from the struct smp_hotplug_thread definition above, there are two function pointers being registered: ksoftirqd_should_run and run_ksoftirqd.
Both of these functions are called from kernel/smpboot.c as part of something which resembles an event loop.
The code in kernel/smpboot.c first calls ksoftirqd_should_run which determines if there are any pending softirqs and, if there are pending softirqs, run_ksoftirqd is executed. The run_ksoftirqd does some minor bookkeeping before it calls __do_softirq.


The __do_softirq function does a few interesting things:
  • determines which softirq is pending
  • softirq time is accounted for statistics purposes
  • softirq execution statistics are incremented
  • the softirq handler for the pending softirq (which was registered with a call to open_softirq) is executed.
So, when you look at graphs of CPU usage and see softirq or si you now know that this is measuring the amount of CPU usage happening in a deferred work context.



The softirq system increments statistic counters which can be read from /proc/softirqs Monitoring these statistics can give you a sense for the rate at which softirqs for various events are being generated.
cat  /proc/softirqs
This file can give you an idea of how your network receive (NET_RX) processing is currently distributed across your CPUs. If it is distributed unevenly, you will see a larger count value for some CPUs than others. This is one indicator that you might be able to benefit from Receive Packet Steering / Receive Flow Steering described below. Be careful using just this file when monitoring your performance: during periods of high network activity you would expect to see the rate NET_RX increments increase, but this isn’t necessarily the case. It turns out that this is a bit nuanced, because there are additional tuning knobs in the network stack that can affect the rate at which NET_RX softirqs will fire, which we’ll see soon.
You should be aware of this, however, so that if you adjust the other tuning knobs you will know to examine /proc/softirqs and expect to see a change.

Linux network device subsystem

Now that we’ve taken a look in to how network drivers and softirqs work, let’s see how the Linux network device subsystem is initialized. Then, we can follow the path of a packet starting with its arrival.

Initialization of network device subsystem

The network device (netdev) subsystem is initialized in the function net_dev_init. Lots of interesting things happen in this initialization function.

Initialization of struct softnet_data structures

net_dev_init creates a set of struct softnet_data structures for each CPU on the system. These structures will hold pointers to several important things for processing network data:
  • List for NAPI structures to be registered to this CPU.
  • A backlog for data processing.
  • The processing weight.
  • The receive offload structure list.
  • Receive packet steering settings.
  • And more.
Each of these will be examined in greater detail later as we progress up the stack.

Initialization of softirq handlers

net_dev_init registers a transmit and receive softirq handler which will be used to process incoming or outgoing network data. The code for this is pretty straight forward:
static int __init net_dev_init(void)
  /* ... */

  open_softirq(NET_TX_SOFTIRQ, net_tx_action);
  open_softirq(NET_RX_SOFTIRQ, net_rx_action);

 /* ... */
We’ll see soon how the driver’s interrupt handler will “raise” (or trigger) the net_rx_action function registered to the NET_RX_SOFTIRQ softirq.

Data arrives

Assuming that the RX queue has enough available descriptors, the packet is written to RAM via DMA. The device then raises the interrupt that is assigned to it (or in the case of MSI-X, the interrupt ti

Interrupt handler

In general, the interrupt handler which runs when an interrupt is raised should try to defer as much processing as possible to happen outside the interrupt context. This is crucial because while an interrupt is being processed, other interrupts may be blocked.
Let’s take a look at the source for the MSI-X interrupt handler; it will really help illustrate the idea that the interrupt handler does as little work as possible.
static irqreturn_t e1000_msix_ring(int irq, void *data)
  struct net_device * netdev = data;



  return IRQ_HANDLED;
This interrupt handler is very short and performs 2 very quick operations before returning.
First, this function calls e1000e_write_itr which simply updates a hardware specific register. In this case, the register that is updated is one which is used to track the rate hardware interrupts are arriving.
This register is used in conjunction with a hardware feature called “Interrupt Throttling” (also called “Interrupt Coalescing”) which can be used to to pace the delivery of interrupts to the CPU. We’ll see soon how ethtool provides a mechanism for adjusting the rate at which IRQs fire.
Secondly, napi_schedule is called which wakes up the NAPI processing loop if it was not already active. Note that the NAPI processing loop executes in a softirq; the NAPI processing loop does not execute from the interrupt handler. The interrupt handler simply causes it to start executing if it was not already.
The actual code showing exactly how this works is important; it will guide our understanding of how network data is processed on multi-CPU systems.ed to the rx queue the packet arrived on).

NAPI and napi_schedule

Let’s figure out how the napi_schedule call from the hardware interrupt handler works.
Remember, NAPI exists specifically to harvest network data without needing interrupts from the NIC to signal that data is ready for processing. As mentioned earlier, the NAPI poll loop is bootstrapped by receiving a hardware interrupt. In other words: NAPI is enabled, but off, until the first packet arrives at which point the NIC raises an IRQ and NAPI is started. There are a few other cases, as we’ll see soon, where NAPI can be disabled and will need a hardware interrupt to be raised before it will be started again.
The NAPI poll loop is started when the interrupt handler in the driver calls napi_schedulenapi_schedule is actually just a wrapper function defined in a header file which calls down to __napi_schedule.

 * __napi_schedule - schedule for receive
 * @n: entry to schedule
 * The entry's receive function will be scheduled to run
void __napi_schedule(struct napi_struct *n)
  unsigned long flags;

  ____napi_schedule(&__get_cpu_var(softnet_data), n);
This code is using __get_cpu_var to get the softnet_data structure that is registered to the current CPU. This softnet_data structure and the struct napi_struct structure handed up from the driver are passed into ____napi_schedule. Wow, that’s a lot of underscores ;)
Let’s take a look at ____napi_schedule, from net/core/dev.c:
/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd,
                                     struct napi_struct *napi)
  list_add_tail(&napi->poll_list, &sd->poll_list);
This code does two important things:
  1. The struct napi_struct handed up from the device driver’s interrupt handler code is added to the poll_list attached to the softnet_data structure associated with the current CPU.
  2. __raise_softirq_irqoff is used to “raise” (or trigger) a NET_RX_SOFTIRQ softirq. This will cause the net_rx_action registered during the network device subsystem initialization to be executed, if it’s not currently being executed.
The softirq handler function net_rx_action will call the NAPI poll function to harvest packets.

A note about CPU and network data processing

Note that all the code we’ve seen so far to defer work from a hardware interrupt handler to a softirq has been using structures associated with the current CPU.
While the driver’s IRQ handler itself does very little work itself, the softirq handler will execute on the same CPU as the driver’s IRQ handler.
This why setting the CPU a particular IRQ will be handled by is important: that CPU will be used not only to execute the interrupt handler in the driver, but the same CPU will also be used when harvesting packets in a softirq via NAPI.
As we’ll see later, things like Receive Packet Steering can distribute some of this work to other CPUs further up the network stack.

Network data processing begins

Once the softirq code determines that a softirq is pending, begins processing, and executes net_rx_action, network data processing begins.
Let’s take a look at portions of the net_rx_action processing loop to understand how it works, which pieces are tunable, and what can be monitored.

net_rx_action processing loop

net_rx_action begins the processing of packets from the memory the packets were DMA’d into by the device.
The function iterates through the list of NAPI structures that are queued for the current CPU, dequeuing each structure, and operating on it.
The processing loop bounds the amount of work and execution time that can be consumed by the registered NAPI poll functions. It does this in two ways:
  1. By keeping track of a work budget (which can be adjusted), and
  2. Checking the elapsed time
  while (!list_empty(&sd->poll_list)) {
    struct napi_struct *n;
    int work, weight;

    /* If softirq window is exhausted then punt.
     * Allow this to run for 2 jiffies since which will allow
     * an average latency of 1.5/HZ.
    if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit)))
      goto softnet_break;
This is how the kernel prevents packet processing from consuming the entire CPU. The budget above is the total available budget that will be spent among each of the available NAPI structures registered to this CPU.
This is another reason why multiqueue NICs should have the IRQ affinity carefully tuned. Recall that the CPU which handles the IRQ from the device will be the CPU where the softirq handler will execute and, as a result, will also be the CPU where the above loop and budget computation runs.
Systems with multiple NICs each with multiple queues can end up in a situation where multiple NAPI structs are registered to the same CPU. Data processing for all NAPI structs on the same CPU spend from the same budget.
If you don’t have enough CPUs to distribute your NIC’s IRQs, you can consider increasing the net_rx_action budget to allow for more packet processing for each CPU. Increasing the budget will increase CPU usage (specifically sitime or si in top or other programs), but should reduce latency as data will be processed more promptly.
Note: the CPU will still be bounded by a time limit of 2 jiffies, regardless of the assigned budget.

NAPI poll function and weight

Recall that network device drivers use netif_napi_add for registering poll function. As we saw earlier in this post, the e1000e driver has a piece of code like this:
  /* initialize NAPI */
  netif_napi_add(netdev, &adapter->napi, e1000e_poll, 64);
This registers a NAPI structure with a hardcoded weight of 64. We’ll see now how this is used in the net_rx_action processing loop.
weight = n->weight;

work = 0;
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
        work = n->poll(n, weight);

WARN_ON_ONCE(work > weight);

budget -= work;
This code obtains the weight which was registered to the NAPI struct (64 in the above driver code) and passes it into the poll function which was also registered to the NAPI struct (e1000e_poll in the above code).
The poll function returns the number of data frames that were processed. This amount is saved above as work, which is then subtracted from the overall budget.
So, assuming:
  1. You are using a weight of 64 from your driver (all drivers were hardcoded with this value in Linux 5.6.0), and
  2. You have your budget set to the default of 300
Your system would stop processing data when either:
  1. The e1000e_poll function was called at most 5 times (less if no data to process as we’ll see next), OR
  2. At least 2 jiffies of time have elapsed.

The NAPI / network device driver contract

One important piece of information about the contract between the NAPI subsystem and device drivers which has not been mentioned yet are the requirements around shutting down NAPI.
This part of the contract is as follows:
  • If a driver’s poll function consumes its entire weight (which is hardcoded to 64) it must NOT modify NAPI state. The net_rx_action loop will take over.
  • If a driver’s poll function does NOT consume its entire weight, it must disable NAPI. NAPI will be re-enabled next time an IRQ is received and the driver’s IRQ handler calls napi_schedule.
We’ll see how net_rx_action deals with the first part of that contract now. Next, the poll function is examined, we’ll see how the second part of that contract is handled.

1 comment: