DPDK application test-pipeline fails in app_ports_check_link

Test Setup: Linux-Server-1 Port-A <==> Port 1 DPDK-Server-2 Port 2 <==> Port B Linux-Server-2.

Steps Followed:

The physical links are connected
Devices bound to DPDK:

Network devices using DPDK-compatible driver
============================================
0000:03:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=uio_pci_generic unused=ixgbe,vfio-pci
0000:03:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=uio_pci_generic unused=ixgbe,vfio-pci

Network devices using kernel driver
===================================
0000:05:00.0 'I210 Gigabit Network Connection 1533' if=enp5s0 drv=igb unused=vfio-pci,uio_pci_generic *Active*
0000:06:00.0 'I210 Gigabit Network Connection 1533' if=enp6s0 drv=igb unused=vfio-pci,uio_pci_generic

Issue: Port 2 of DPDK-Server is returning down app_ports_check_link

[EDIT] Running with DPDK example, I am able to get packets send to DPDK port 1 and port 2.

Log for eventdev:

EAL: PCI device 0000:03:00.0 on NUMA socket 0
EAL:   probe driver: 8086:10fb net_ixgbe
EAL: PCI device 0000:03:00.1 on NUMA socket 0
EAL:   probe driver: 8086:10fb net_ixgbe
EAL: PCI device 0000:05:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1533 net_e1000_igb
EAL: PCI device 0000:06:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1533 net_e1000_igb
USER1: Creating the mbuf pool ...
USER1: Initializing NIC port 0 ...
USER1: Initializing NIC port 1 ...
USER1: Port 0 (10 Gbps) UP
USER1: Port 1 (0 Gbps) DOWN
PANIC in app_ports_check_link():
Some NIC ports are DOWN
8: [./build/pipeline(_start+0x2a) [0x558dc37c1d8a]]
7: [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f318e9f5b97]]
6: [./build/pipeline(main+0x7a) [0x558dc37c1fa4]]
5: [./build/pipeline(_Z8app_initv+0x18) [0x558dc37c2940]]
4: [./build/pipeline(+0x8c909) [0x558dc37c2909]]
3: [./build/pipeline(+0x8c677) [0x558dc37c2677]]
2: [./build/pipeline(__rte_panic+0xc5) [0x558dc37b4a90]]
1: [./build/pipeline(rte_dump_stack+0x2e) [0x558dc385954e]]
fish: “sudo ./build/pipeline” terminated by signal SIGABRT (Abort)

Code

static void
app_ports_check_link(void)
{
    uint32_t all_ports_up, i;

    all_ports_up = 1;

    for (i = 0; i < app.n_ports; i++) {
        struct rte_eth_link link;
        uint16_t port;

        port = app.ports[i];
        memset(&link, 0, sizeof(link));
        rte_eth_link_get_nowait(port, &link);

        RTE_LOG(INFO, USER1, "Port %u (%u Gbps) %s\n",
            port,
            link.link_speed / 1000,
            link.link_status ? "UP" : "DOWN");

        if (link.link_status == ETH_LINK_DOWN)
            all_ports_up = 0;
    }

    if (all_ports_up == 0)
        rte_panic("Some NIC ports are DOWN\n");
}

static void
app_init_ports(void)
{
    uint32_t i;
    struct rte_eth_conf port_conf = app_port_conf_init();
    struct rte_eth_rxconf rx_conf = app_rx_conf_init();
    struct rte_eth_txconf tx_conf = app_tx_conf_init();
    (void)tx_conf;

    /* Init NIC ports, then start the ports */
    for (i = 0; i < app.n_ports; i++) {

        uint16_t port;
        int ret;

        port = app.ports[i];
        RTE_LOG(INFO, USER1, "Initializing NIC port %u ...\n", port);

        /* Init port */
        ret = rte_eth_dev_configure(
            port,
            1,
            1,
            &port_conf);
        if (ret < 0)
            rte_panic("Cannot init NIC port %u (%s)\n",
                    port, rte_strerror(ret));

        rte_eth_promiscuous_enable(port);

        /* Init RX queues */
        ret = rte_eth_rx_queue_setup(
            port,
            0,
            app.port_rx_ring_size,
            rte_eth_dev_socket_id(port),
            &rx_conf,
            app.pool);
        if (ret < 0)
            rte_panic("Cannot init RX for port %u (%d)\n",
                (uint32_t) port, ret);

        /* Init TX queues */
        ret = rte_eth_tx_queue_setup(
            port,
            0,
            app.port_tx_ring_size,
            rte_eth_dev_socket_id(port),
            NULL);
        if (ret < 0)
            rte_panic("Cannot init TX for port %u (%d)\n",
                (uint32_t) port, ret);

        /* Start port */
        ret = rte_eth_dev_start(port);
        if (ret < 0)
            rte_panic("Cannot start port %u (%d)\n", port, ret);
    }

    app_ports_check_link();
}

[EDIT] 2020/7/1 Update

run $RTE_SDK/examples/skeleton/build/basicfwd -l 1, I got the following:

EAL: Detected 24 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: PCI device 0000:03:00.0 on NUMA socket 0
EAL:   probe driver: 8086:10fb net_ixgbe
EAL: PCI device 0000:03:00.1 on NUMA socket 0
EAL:   probe driver: 8086:10fb net_ixgbe
EAL: PCI device 0000:05:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1533 net_e1000_igb
EAL: PCI device 0000:06:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1533 net_e1000_igb
Port 0 MAC: 9c 69 b4 60 90 26
Port 1 MAC: 9c 69 b4 60 90 27

Core 1 forwarding packets. [Ctrl+C to quit]
recv pkts num: 1, port: 0
================= Ether header ===============
srcmac: 9C:69:B4:60:90:17
dstmac: 33:33:00:00:00:16
ethertype: 34525
This packet is IPv6
================= Ether header ===============
srcmac: 9C:69:B4:60:90:17
dstmac: 33:33:00:00:00:16
ethertype: 34525
This packet is IPv6
send 1 pkts, port: 1
recv pkts num: 1, port: 1
================= Ether header ===============
srcmac: 9C:69:B4:60:90:1C
dstmac: 33:33:00:00:00:16
ethertype: 34525
This packet is IPv6
================= Ether header ===============
srcmac: 9C:69:B4:60:90:1C
dstmac: 33:33:00:00:00:16
ethertype: 34525
This packet is IPv6
send 1 pkts, port: 0
recv pkts num: 1, port: 1
================= Ether header ===============
srcmac: 9C:69:B4:60:90:1C
dstmac: 33:33:00:00:00:16
ethertype: 34525
This packet is IPv6
================= Ether header ===============
srcmac: 9C:69:B4:60:90:1C
dstmac: 33:33:00:00:00:16
ethertype: 34525
This packet is IPv6
send 1 pkts, port: 0
...

It seems that there is no problem with the two ports. Strange!

[EDIT] 2020/7/2 Update

After replacing rte_eth_link_get_nowait with rte_eth_link_get, the program can work normally.

Following @Vipin Varghese's suggestion, I have checked the ports' settings with ethtool DEVNAME and ethtool -a DEVNAME:

DPDK-server port-1:

Settings for ens1f1:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  10000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

others

Settings for ens1f0:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseT/Full
                                10000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  1000baseT/Full
                                10000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

all ports have the same pause parameters

Autonegotiate:  off
RX:             on
TX:             on

But I'm really confused:

What's the difference between current settings and pause parameters? As you can see, in current settings, Auto-negotiation is on, but in pause parameters, Autonegotiate is off.
What's the difference between rte_eth_link_get_nowait and rte_eth_link_get? DPDK doc. Why autoneg can make them behave differently?

Solution

Explanation:

Running ethtool -while application is down is not a trusted way. Depending upon DPDK version rte_eth_dev_closeorrte_cleanup` would not have put the NIC in the right state.
But while running the application, if DPDK port-1 is coming as down following can be reason

a. Server-3 port might be auto-negotiating with DPDK port-1 leading to rte_eth_link_get_nowait to report as down. (right API is to invoke rte_eth_link_get). b. The Server-3 port might manually be configured in non-duplex and non 10G mode.

the right way to debug is to

put DPDK port back to the kernel as suggested in comments.
cross-check auto-neg and speed.
configure on server-1 and server-3 with no auto-neg, 10G, full-duplex
bind the server-2 port-0 and port-1 to DPDK.
run DPDK test-pipeline if possible with whitelist.
run ethtool -t for port-B on server-3 to cross the results too.

note: this will help you identify if it server-3 ports driver/firmware which acts differently with auto-neg as the ports are sending and receiving packets is successful with example/skeleton with command $RTE_SDK/examples/skeleton/build/basicfwd -l 1

[EDIT-1] based on the update from the comment it looks like rte_eth_link_get_nowait is the fast approach, the right one is to be used with rte_eth_link_get. Requested for online debug with the author

[EDIT-2] based on the comment rte_eth_link_get has done the desired job. As I recollect rte_eth_link_get wait for the actual readout from physical device registers, while rte_eth_link_get_nowait is invoked without wait. hence the right values are populated for rte_eth_link_get.