Search code examples
dpdk

tcp packet loss occurs occasionally when use dpdk19.11 i40e NIC


I am using XL710 i40e NIC on dpdk19.11. I found that the NIC occasionally lost packets when I enabled the TSO feature.

The detailed information is as follow: https://github.com/JiangHeng12138/dpdk-issue/issues/1

I gussed lost packet is caused by i40e NIC dirver, but I dont know how to debug i40e driver code, could you please provide me an effective way.


Solution

  • Based on the problem statement tcp packet loss occurs occasionally when use dpdk19.11 i40e NIC, one needs to isolate the issue whether is it is client (peer system) or server (dpdk DUT) which leads to packet loss. So to debug the issue at DPDK server side, one needs to evaluate both RX and TX issues. DPDK tool dpdk-procinfo can retrieve port statistics, which can be used for the analysis of the issue.

    Diagnose the issue:

    1. Run the application (dpdk primary) to reproduce the issue in terminal-1.
    2. In terminal-2, run the command dpdk-procinfo -- --stats. refer link for more details
    3. Check RX-errors counter, this will show if the packets which were faulty were dropped at PMD level.
    4. Check RX-nombuf counter, this will show if the packets from NIC were not able to be DMA to DDR memory on the HOST.
    5. Check TX-errors counter, this will show if the copy of packet descriptor (DMA descriptors) to NIC had been faulty or not.
    6. Also check the HW nic statistics with dpdk-procinfo -- --xstats for any error or drop counter updates.

    sample of the capture of stats and xstats counters on the desired nic

    Note:

    1. "tx_good_packets" means the number of packets sent by the dpdk NIC. if the number of packets tried to be sent is equal to "tx_good_packets", there is no packet dropped at the sent client.

    2. "rx-missed-errors" means packets loss at the receiver; this means you are processing packets more than what the Current CPU can handle. So either you will need to increase CPU frequency, or use additional cores to distribute the traffic.

    If none of these counters is updated or errors are found, then the issue is at the peer (client non-dpdk) side.