Search code examples
c++linuxtcp

Linux socket data sit in send queue until receive timeout


When sending data as a TCP client under Linux, it occasionally happens that the write operation is successful, but the data seems to remain in the send buffer and is only sent out after a receive timeout. This conclusion is based on my use of tcpdump to capture packets and my collection of SendQ sizes during the lag period. Regarding socket configuration, it is currently set to blocking mode for sending and the NODELAY flag is enabled. By the way, the server side is an old-fashioned SLIP device. data stuck in sendqueque

I would like to know what means can be used to troubleshoot this problem.

Below is the send function.

erpc_status_t TCPTransport::underlyingSend(const uint8_t *data, uint32_t size, void *arg)
{
    int socket = (NULL == arg) ? m_socket : *(int *)arg;
    if (socket < 0)
    {
        return kErpcStatus_InvalidArgument;
    }

    // Loop until all data is sent.
    while (size)
    {
#ifndef WIN32
        ssize_t result = write(socket, data, size);
#else
        int result = send_Data(socket, (char *)data, size, 0);
#endif
        if (result >= 0)
        {
            size -= result;
            data += result;
        }
        else
        {
            if (errno == EPIPE)
            {
                // Server closed.
                //close();
                TCP_DEBUG_ERR("underlyingSend() connect closed.");
                return kErpcStatus_ConnectionClosed;
            }
            TCP_DEBUG_ERR("underlyingSend() send failed.");
            return kErpcStatus_SendFailed;
        }
    }

    return kErpcStatus_Success;
}

[erpc_tcp_transport.cpp][1] [1]: https://github.com/EmbeddedRPC/erpc/blob/develop/erpc_c/transports/erpc_tcp_transport.cpp

I have to add some details. I conducted a communication test for the abnormal connections. When no other programs were running in the system, the test conducted 2,000,000 communication attempts without any abnormalities. However, the issue arises when the system operates as a whole. Therefore, I still suspect that the issue might be caused by the system or network being busy. We have also tried to capture the system resource utilization rate, and it indicates that the CPU and memory resources should be sufficient.


Solution

  • Finally we figure it out,Lockless qdisc has concurrent problem,it's a kernel bug in linux 5.10,below is the link which solve the problem.

    Lockless qdisc has concurrent problem.