Search code examples
performancecachingiox86dpdk

Why does DPDK + mellanox connectx5 process 128B packets much faster than other sizes packets, when ruuning an I/O intensive application?


For my measurements, there are two machines, one as client node(Haswell),the other one as server node(Skylake),and both nodes with the NIC,mellanox connect5. client sends packets to the server at a high rate(Gpps), and a simple application -- L2 forwarding, running on the server node with 4096 RX descriptors. I have sent many sizes of packets(64B,128B,256B,512B,1024B,1500B) ,however I get a interesting result. When I send the 128B packets, the latency(both LAT99 and LAT-AVG) is much better than other sizes packets.

There are my measurements results below:

packet size THROUGHPUT PPS LAT99 LATAVG
64B 14772199568.1 20983238.0228 372.75 333.28
128B 22698652659.5 18666655.1476 51.25 32.92
256B 27318589720 12195798.9821 494.75 471.065822332
512B 49867099486 11629454.1712 491.5 455.98037273
1024B 52259987845.5 6233300.07701 894.75 842.567256665
1500B 51650191179.9 4236400.1952 1298.5 1231.18194373

some settings and configuration: sudo mlxconfig -d 0000:18:00.1 q enter image description here enter image description here enter image description here

The server node(Skylake) is enable DDIO,so the packets are sent directly to the L3 cache. The latency gap between 333.28 and 32.92 is similar to the gap between L1 cache and L3 cache. So, I guess it might be due to L1 prefetch. L1 cache prefetches better, when receives 128B packets than other size packets.

My question:1.Is my guess correct? 2.Why is it faster to process 128B packets, is there any specific L1 prefetch strategy that can explain this result? 3. If my guess is wrong, what is causing this phenomenon?


Solution

  • @xuxingchen there are multiple questions and clarifications required to address the questions. So let me clarify step by step

    1. Current setup is listed as Mellznox Connectx 5, but mlxconfig states it is DPU. DPU has internal engine and Latency will be different foundational NIC from Mellanox such as MLX-4, MLX-5, ConnectX-6.
    2. PCIe read size is recommended to be updated to read size of 1024
    3. It is mentioned as SKYLAKE which has PCIe gen 3.0, but mlxconfig reports PCIe gen4.0 as connection
    4. CQE compressed is balanced, but recommended setting (even for vector mode) is aggressive
    5. For DDIO to work the PCIe device (firmware) needs TPH (TLP processing hints) activated to allow Steering tag to be populated from user space to NIC firmware. In Intel NIC there are code in DPDK PMD to achieve the same.
    6. In case of Mellanox, I do not find the TPH enabling code in PMD. Hence I have to speculate the if the DPU NIC support DDIO, it might be through driver tag steering via MSIX interupts pinned to CPU core. For this one needs to disble irqaffinity of the current NIC, and allow pinning all the interrupts to specific cores (other than DPDK).

    With these my recommendations for the right settings (only foundation NIC CX-5, CX-6 and not DPU since I have not tested) are

    systemctl stop irqbalance.service
    systemctl disable irqbalance.service
    systemctl stop wpa_supplicant
    systemctl disable wpa_supplicant
    ./set_irq_affinity_cpulist.sh [non dpdk cores] [desired NIC]
    mlxconfig -d [pcie device id] set SRIOV_EN=0
    mlx_tune -r
    ifconfig [NIC] txqueuelen 20000
    ethtool -G [NIC] rx 8192 tx 8192
    ethtool -A [NIC] rx off tx off
    mlxconfig -d [pcie address] set ZERO_TOUCH_TUNING_ENABLE=1
    mlxconfig -d [pcie address] set CQE_COMPRESSION=1
    mlxconfig -d [pcie address] s PCI_WR_ORDERING=1
    

    With the above settings and settings from the performance report with MLX-5 foundational NIC, I am able to achieve on AMD EPYC following result

    Performance with vector mode with MLX-5

    [EDIT-1] based on the comment, there is an incorrect assumption that CPU is the bottleneck for fewer packets per second per queue. To prove it is no CPU or platform issue, same test is run with multiple Mellanox with 1 CPU (that is 1 RX queue per 2 ports)

    enter image description here

    note: with other vendors NIC (Intel & Broadcom) one can easily achieve 68MPPs and 55MPPs with 1 port 1 rx-queue respectively. enter image description here

    using multiple RX queue on the same CPU core we can achieve higher MPPs (show casing individual RX queue is limiting factor on MLX)