I have developed a XDP program that filters packets based on some specific rules and then either drops them (XDP_DROP
) or redirects them (xdp_redirect_map
) to another interface. This program was well able to process a synthetic load of ~11Mpps (that's all what my traffic generator is capable of) on just four CPU cores.
Now I've changed that program to use XDP_TX
to send the packets out on the interface they were received on instead of redirecting them to another interface. Unfortunately, this simple change caused a big drop in throughput and now it hardly handles ~4Mpps.
I don't understand, what could be the cause for this or how to debug this further, that's why I'm asking here.
My minimal test setup to reproduce the issue:
./pktgen_sample05_flow_per_thread.sh -i ens3 -s 64 -d 1.2.3.4 -t 4 -c 0 -v -m MACHINE2_MAC
(4 threads, because this was the config that resulted in the highest generated Mpps even though the machine has way more than 4 cores)XDP_DROP
return code with XDP_TX
. - Whether I swap the src/dest mac addresses before reflecting the packet did never cause a difference in throughput, so I'm leaving this out here.When running the program with XDP_DROP
, 4 cores on Machine 2 are slightly loaded with ksoftirqd
threads while dropping around ~11Mps. That only 4 cores are loaded makes sense, given that pktgen sends out 4 different packets that fill only 4 rx queues becaue of how the hashing in the NIC works.
But when running the program with XDP_TX
, one of the cores is a ~100% busy with ksoftirqd
and only ~4Mpps are processed. Here I'm not sure, why that happens.
Do you have an idea, what might be causing this throughput drop and CPU usage increase?
Edit: Here some more details about the configuration of Machine 2:
# ethtool -g ens2f0
Ring parameters for ens2f0:
Pre-set maximums:
RX: 4096
RX Mini: n/a
RX Jumbo: n/a
TX: 4096
Current hardware settings:
RX: 512 # changing rx/tx to 4096 didn't help
RX Mini: n/a
RX Jumbo: n/a
TX: 512
# ethtool -l ens2f0
Channel parameters for ens2f0:
Pre-set maximums:
RX: n/a
TX: n/a
Other: 1
Combined: 63
Current hardware settings:
RX: n/a
TX: n/a
Other: 1
Combined: 32
# ethtool -x ens2f0
RX flow hash indirection table for ens2f0 with 32 RX ring(s):
0: 0 1 2 3 4 5 6 7
8: 8 9 10 11 12 13 14 15
16: 0 1 2 3 4 5 6 7
24: 8 9 10 11 12 13 14 15
32: 0 1 2 3 4 5 6 7
40: 8 9 10 11 12 13 14 15
48: 0 1 2 3 4 5 6 7
56: 8 9 10 11 12 13 14 15
64: 0 1 2 3 4 5 6 7
72: 8 9 10 11 12 13 14 15
80: 0 1 2 3 4 5 6 7
88: 8 9 10 11 12 13 14 15
96: 0 1 2 3 4 5 6 7
104: 8 9 10 11 12 13 14 15
112: 0 1 2 3 4 5 6 7
120: 8 9 10 11 12 13 14 15
RSS hash key:
d7:81:b1:8c:68:05:a9:eb:f4:24:86:f6:28:14:7e:f5:49:4e:29:ce:c7:2e:47:a0:08:f1:e9:31:b3:e5:45:a6:c1:30:52:37:e9:98:2d:c1
RSS hash function:
toeplitz: on
xor: off
crc32: off
# uname -a
Linux test-2 5.8.0-44-generic #50-Ubuntu SMP Tue Feb 9 06:29:41 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Edit 2: I've also tried MoonGen as a packet generator now and flooded Machine 2 with 10Mpps and 100 different packet variations (flows). Now the traffic is way better distributed between the cores when dropping all these packets with minimal CPU load. But XDP_TX
can still not keep up and loads a single core to a 100% while processing ~3Mpps.
I've now upgraded the kernel of Machine 2 to 5.12.0-rc3
and the issue disappeared. Looks like this was a kernel issue.
If somebody knows more about this or has a changelog regarding this, please let me know.