Search code examples
phpxmlubuntutcptimeout

TCP Retransmissions causing timeouts during XML conversation between Ubuntu client (curl) and server


Im trying to ascertain, and solve the reason behind the ACK FIN retransmissions outlined in the screenshot below. I am running a ubuntu 18 apache server with PHP, which is communicating with a XML service provider via PHP curl (PHP 7.3). During the XML conversations the server in question seems to time out at random intervals. This has a effect on the xml service of either timing out or returning an incomplete xml resultset. Another developer has informed me they had similar issues, and resorted to a PHP loop to retry failed requests several times until the service responded correctly (from their own client dev boxes). This points to the theory something may be wrong on the server side, however I cannot entertain this until i've excluded all possibilities my side.(what if the other devs used the same client OS as me? etc)

I have tested extensively from postman on my local windows machine, however I cannot get timeouts/errors to happen here. Not sure whether postman has some kinda error correcting mechanism, or whether my windows itself seems to jive better against the TCPIP stack of the server machine.

What I have tried so far: - knocked the connection down from https to http so i can capture with wireshark on the Ubuntu client (https as the cause excluded at the same time) - changed MTU from 9001 (ec2 instance) to 1500 and 1492 respectively, problem still there, service server's MTU appears to be 1500 - enabled keepalives on curl, no effect - attempted different timeouts and connection timeouts in curl settings, no effect - attempted the same loop in php to retry the request. added a flag to sever the tcp connection completely if retrying within curl, no effect, retried requests will still time out sometimes, other times they return what is expected in xml. Seemingly random. From 20 requests maybe 2 may fail.

notes: Seems like after the client sends a post xml request, server responds with an ACK, but never sends the status 200, as I captured this when the request failed. This seems to result in repeated FIN ACK retransmissions from client, some error correction seems to happen here however this does not bubble up to the XML layer to present a complete request, instead CURL issues a timeout whilst it is waiting for the response. In wireshark I could see the incomplete answer, ie about half of xml reconstructed at line 42. My only hunches are that maybe the server is a windows box which may somehow be incompatible with the ubuntu's tcp ip stack, or that it is simply a bug on the server which may not be fixed no matter what I do, aside from repeated requests.

Any ideas? I am not an TCPIP expert so FINS and ACKS are only generally comprehended :) what would you try next?

wireshark screenshot


Solution

  • There is a lot that can be decoded from the network capture. First note that frames 46 and 47 (the start of a subsequent TCP connection since we don't see the start of the connection in question) negotiate a maximum segment size (MSS) of 1460. However, frame 53 has 4380 bytes (3 times MSS) and frame 55 has 7300 bytes (5 times MSS). This is most likely because the network capture was taken on the client host and some form of receive offloading is enabled on the NIC or the driver. In general it is best to get a network capture off the network (e.g. by spanning a switch port and getting a capture from the NIC on the spanned port in promiscuous mode). So keep in mind that what we see in the capture is not exactly what was on the wire.

    We see the request under consideration being sent in frame 34 and then we see it acknowledged (with sequence 510) 11ms later.

    Then nothing happens for roughly 20 seconds. This, at the TCP level, is perfectly normal. As best we can tell there is no unacknowledged data transmitted (thus TCP won't close the connection and TCP user timeout isn't in play even if enabled), so this connection could sit idle like this for quite some time until TCP keepalive kicks in (if enabled).

    Then, in frame 36, the client sends a FIN with sequence 510. At a TCP level this indicates that the client is done sending, which would be caused by the application calling close() or shutdown() on the socket. Since this is very nearly an even 20 seconds, it sure feels like an application level timeout, although it does not appear to be the default timeout for PHP Curl, but then you have clearly been messing around with various settings.

    We would now expect the server to ACK the transmitted FIN and to send whatever data it needs to send, and then when the server application calls close() or shutdown() a FIN will be sent from the server.

    However, the server does not ACK the client's FIN, and the client TCP stack re-transmits the FIN 5 times. You can see the re-transmission time doubling after each FIN, with the 6th re-transmission expected at around 31.38 seconds, assuming it hasn't given up. On Linux, I believe this is controlled by tcp_orphan_retries which defaults to 8, so my best guess is that it hasn't given up.

    Finally, in frame 42, the server speaks, well before the next FIN re-transmission is expected. It ACKs sequence 511, indicating it received one or more of the FINs. And it contains a full MSS payload, which from a protocol level is just fine.

    So now the question is, why didn't the server TCP stack ACK the FIN until now? And is the full MSS payload a re-transmission, or is it the first time it is being sent? Let's hold of on guessing for now, as there is additional helpful information in the capture.

    The client immediately responds with a RST. This is either because TCP did give up on the server and the connection no longer exists, or because the application called close() on the socket so the TCP stack can't deliver the data to the client application and all hope of an orderly shutdown is gone. I'm guessing it is the latter.

    Then, surprisingly, in frame 44, we get a FIN from the server with a sequence of 16061. WAT???. Frame 42 sent the first 1460 bytes of the response, so there are 16060-1460 bytes of data amounting to 10 MSS missing. There is clearly packet loss here (which is why the WireSharks annotated the packet with "Previous segment not captured").

    I don't think that the FIN is a re-transmission, because there is plenty of unacknowledged data to be re-transmitted (11 MSS worth), and that FINs are generally transmitted promptly on close() even with outstanding unacked data. So I'm guessing this FIN was sent when the server called close() on the socket.

    So if the FIN is not a retransmission, I'm also guessing that the first 1460 bytes received in frame 42 just 10ms earlier is probably not as well. I'm thinking that the server did its thinking, wrote 11 MSS worth of data to the socket and closed the socket, causing 11 packets with payload (admittedly all are unacked, but a contemporary implementation/configuration of TCP "slow start"would allow for this) followed by the FIN, but that only two of those made it through due to some form of packet loss. I'm further guessing that the TCP stack was probably ACKing the multiple FINs but was also subject to packet loss.

    So my suggestion is to increase the timeout for PHP Curl. My best guess is that the server took more than 20s to compute its reply, and packet loss and re-transmission timeouts will only cause further delays.