Search code examples
web-servicestcpfinagle

What does it mean for TCP connections to churn?


In the context of webservices, I've seen the term "TCP connection churn" used. Specifically Twitter finagle has ways to avoid it happening. How does it happen? What does it mean?


Solution

  • There might be multiple uses for this term, but I've always seen it used in cases where many TCP connections are being made in a very short space of time, causing performance issues on the client and potentially the server as well.

    This often occurs when client code is written which automatically connects on a TCP failure of any sort. If this failure happens to be a connection failure before the connection is even made (or very early on in the protocol exchange) then the client can go into a near-busy loop constantly making connections. This can cause performance issues on the client side - firstly that there is a process in a very busy loop sucking up CPU cycles, and secondly that each connection attempt consumes a client-side port number - if this goes fast enough the software can wrap around when they hit the maximum port number (as a port is only a 16-bit number this certainly isn't impossible).

    While writing robust code is a worthy aim, this simple "automatic retry" approach is a little too naive. You can see similar problems in other contexts - e.g. a parent process continually restarting a child process which immediately crashes. One common mechanism to avoid it is some sort of increasing back-off. So, when the first connection fails you immediately reconnect. If it fails again within a short time (e.g. 30 seconds) then you wait, say, 2 seconds before reconnecting. If it fails again within 30 seconds, you wait 4 seconds, and so on. Read the Wikipedia article on exponential backoff (or this blog post might be more appropriate for this application) for more background on this technique.

    This approach has the advantage that it doesn't overwhelm the client or server, but it also means the client can still recover without manual intervention (which is especially crucial for software on an unattended server, for example, or in large clusters).

    In cases where recovery time is critical, simple rate-limiting of TCP connection creation is also quite possible - perhaps no more than 1 per second or something. If there are many clients per server, however, this more simplistic approach can still leave the server's swamped by the load of accepting then closing a high connection rate.

    One thing to note if you plan to employ exponential backoff - I suggest imposing a maximum wait time, or you might find that prolonged failures leave a client taking too long to recover once the server end does start accepting connections again. I would suggest something like 5 minutes as a reasonable maximum in most circumstances, but of course it depends on the application.