Search code examples
linux-kernelebpfxdp-bpf

Low throughput with XDP_TX in comparison with XDP_DROP/REDIRECT


I have developed a XDP program that filters packets based on some specific rules and then either drops them (XDP_DROP) or redirects them (xdp_redirect_map) to another interface. This program was well able to process a synthetic load of ~11Mpps (that's all what my traffic generator is capable of) on just four CPU cores.

Now I've changed that program to use XDP_TX to send the packets out on the interface they were received on instead of redirecting them to another interface. Unfortunately, this simple change caused a big drop in throughput and now it hardly handles ~4Mpps.

I don't understand, what could be the cause for this or how to debug this further, that's why I'm asking here.

My minimal test setup to reproduce the issue:

  • Two machines with Intel x520 SFP+ NICs directly connected to each other, both NICs are configured to have as many "combined" queus as the machine has CPU cores.
  • Machine 1 runs pktgen using a sample application from the linux sources: ./pktgen_sample05_flow_per_thread.sh -i ens3 -s 64 -d 1.2.3.4 -t 4 -c 0 -v -m MACHINE2_MAC (4 threads, because this was the config that resulted in the highest generated Mpps even though the machine has way more than 4 cores)
  • Machine 2 runs a simple program that drops (or reflects) all packets and counts the pps. In that program, I've replaced the XDP_DROP return code with XDP_TX. - Whether I swap the src/dest mac addresses before reflecting the packet did never cause a difference in throughput, so I'm leaving this out here.

When running the program with XDP_DROP, 4 cores on Machine 2 are slightly loaded with ksoftirqd threads while dropping around ~11Mps. That only 4 cores are loaded makes sense, given that pktgen sends out 4 different packets that fill only 4 rx queues becaue of how the hashing in the NIC works.

But when running the program with XDP_TX, one of the cores is a ~100% busy with ksoftirqd and only ~4Mpps are processed. Here I'm not sure, why that happens.

Do you have an idea, what might be causing this throughput drop and CPU usage increase?

Edit: Here some more details about the configuration of Machine 2:

# ethtool -g ens2f0
Ring parameters for ens2f0:
Pre-set maximums:
RX:             4096
RX Mini:        n/a
RX Jumbo:       n/a
TX:             4096
Current hardware settings:
RX:             512   # changing rx/tx to 4096 didn't help
RX Mini:        n/a
RX Jumbo:       n/a
TX:             512

# ethtool -l ens2f0
Channel parameters for ens2f0:
Pre-set maximums:
RX:             n/a
TX:             n/a
Other:          1
Combined:       63
Current hardware settings:
RX:             n/a
TX:             n/a
Other:          1
Combined:       32

# ethtool -x ens2f0
RX flow hash indirection table for ens2f0 with 32 RX ring(s):
    0:      0     1     2     3     4     5     6     7
    8:      8     9    10    11    12    13    14    15
   16:      0     1     2     3     4     5     6     7
   24:      8     9    10    11    12    13    14    15
   32:      0     1     2     3     4     5     6     7
   40:      8     9    10    11    12    13    14    15
   48:      0     1     2     3     4     5     6     7
   56:      8     9    10    11    12    13    14    15
   64:      0     1     2     3     4     5     6     7
   72:      8     9    10    11    12    13    14    15
   80:      0     1     2     3     4     5     6     7
   88:      8     9    10    11    12    13    14    15
   96:      0     1     2     3     4     5     6     7
  104:      8     9    10    11    12    13    14    15
  112:      0     1     2     3     4     5     6     7
  120:      8     9    10    11    12    13    14    15
RSS hash key:
d7:81:b1:8c:68:05:a9:eb:f4:24:86:f6:28:14:7e:f5:49:4e:29:ce:c7:2e:47:a0:08:f1:e9:31:b3:e5:45:a6:c1:30:52:37:e9:98:2d:c1
RSS hash function:
    toeplitz: on
    xor: off
    crc32: off

# uname -a
Linux test-2 5.8.0-44-generic #50-Ubuntu SMP Tue Feb 9 06:29:41 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Edit 2: I've also tried MoonGen as a packet generator now and flooded Machine 2 with 10Mpps and 100 different packet variations (flows). Now the traffic is way better distributed between the cores when dropping all these packets with minimal CPU load. But XDP_TX can still not keep up and loads a single core to a 100% while processing ~3Mpps.


Solution

  • I've now upgraded the kernel of Machine 2 to 5.12.0-rc3 and the issue disappeared. Looks like this was a kernel issue.

    If somebody knows more about this or has a changelog regarding this, please let me know.