linux sockets network-programming tcp bind

bind() fails after using SO_REUSEADDR option for wildcard socket in state TIME_WAIT

I am running my server application on Linux. My server uses a socket that is bound to an address *::<some_specific_port> (where * means a wildcard ip address).

My program can be destroyed (socket will be closed with close()) or crashed by some external signal.

And I want to restart my application ASAP, without being care about the reliability of tcp (I take care of that in some higher level). When I load my server I use the exact same address (*::<same_exact_port>) but calling bind() syscall fails with errno=EADDRINUSE which means address is already in use.

I looked it up, and saw that the socket is in TIME_WAIT state. After reading a little bit I found out about the reusing address issue in Linux and tcp. But as I said before in my case I don't really care about the reliability, all I care about is to restart my program (that always uses a wildcard ip and the same port) as soon as possible.

I tried to use SO_REUSEADDR and set linger time to 0, but the problem keeps happening. I have seen the SO_REUSEPORT option which seems to solve my problem, but I prefer to avoid using it as much as I can (for security purposes).

I read about the net.ipv4.tcp_tw_reuse option in Linux but the documentation is very vague and unclear. I noticed my machine is configured to net.ipv4.tcp_tw_reuse=0 and I was wondering if enabling this flag would help.

Or maybe the flag is not related and I miss something else.

I have seen this post How do SO_REUSEADDR and SO_REUSEPORT differ?, with a great answer about this topic, but I still don't understand if I can bind the exact same address (wildcard and same port) when the older socket is in TIME_WAIT state and the new socket is set with SO_REUSEADDR in Linux.

Solution

Setting the linger time to zero will cause your socket not to wait for unsent data to be sent (all unsent data is discarded at once), however it will only for sure avoid TIME_WAIT state if the other end has already closed its write pipe.

A socket can be seen as two pipes. A read pipe and a write pipe. Your read pipe is connected to the write pipe of the other side and your write pipe is connected to the read pipe of the other side. When you open a socket, both pipes are opened and when you close a socket, both pipes are closed. However, you can close individual pipes using the shutdown() call.

When you use shutdown to close your write pipe (SHUT_WR or SHUT_RDWR), your socket may end up in TIME_WAIT, even if linger time is zero. And when you call close() on a socket, it will implicitly close both pipes, unless already closed, and if it did close the write pipe, it will have to wait, even if it dropped any pending data from the send buffer.

If the other side calls close() first or at least calls shutdown() with SHUT_WR and only after that you call close(), socket closing may only be delayed by the linger time to ensure unsent data is sent or data in flight is acknowledged. After all data has been sent and acknowledged or after the linger timeout has been hit, whatever happens first, the socket will close at once and not remain in TIME_WAIT state, as it was the other side who initiated the disconnect first.

On some systems setting linger time to zero causes sockets to be closed by reset (RST) instead of an normal close (FIN, ACK), in which case all unsent data is discarded and the socket will not go into TIME_WAIT either, as that is not required after a reset, not even if you closed the socket first. But if a linger time of zero triggers a reset or not is system dependent, you cannot rely on that as there is no standard that defines this behavior. It can also vary if your sockets are blocking or non-blocking and whether shutdown() has been called prior to close() or not.

However, if your app crashes or is killed in the middle of a TCP transmission, both pipes are open and the system has to close the socket on your behalf. In that case some systems will simply ignore any linger configuration and just fall back to standard behavior which you will also get if linger is disabled completely. This means you may end up in TIME_WAIT even with a linger time of zero on systems that would otherwise support closing a socket by reset. Again, this is system specific but has already bitten me in the past on macOS systems.

As for SO_REUSEADDR, this setting does not necessarily allow reuse across different processes for sockets in TIME_WAIT state. If process X has opened socketA and now socketA is in TIME_WAIT state, then process X can for sure bind socketB to the same address and port as socketA, if, and only if it uses SO_REUSEADDR (in case of Linux, both, the socket waiting and the new one requires that flag, in BSD only the new one requires it). But process Y may not be able to bind to a socket to the same address and port as socketA, while socketA is still in TIME_WAIT state for security reasons.

Again, this is system specific and Linux does not always behave like BSD would or POSIX expects. It may also depend on the port number your are using. Sometimes this limitation only applies to ports below 1024 (most people testing behavior forget to test for both, ports above and below 1024). Some system will additionally restrict reuse to the same user (IIRC Windows has such kinds of restrictions).

So what could you possibly do to work around the issue? SO_REUSEPORT is an option, as it has no restriction regarding using exactly the same address+port combination in different processes, since it has explicitly been introduced to Linux to allow port re-use by different processes for the purpose of load balancing between multiple server processes.

Another possibility is to catch any termination of your program (as much as that is possible) and then somehow make the other side close the socket first. As long as the other side initiates the close operation, you will never end up in TIME_WAIT. Of course, pulling this off is tricky and maybe impossible inside a signal handler that is called because your app has crashed, as what you can do in a signal handler is very limited. Usually you work around this by handling signals outside of the handler but if that was a crash signal, it's not clear which calls you can still safely perform and which ones you cannot, even if you handle signals on a different thread than then one who just crashed. Also note that you cannot catch SIGKILL and even when killed like this, the system will cleanly close your sockets.

A nice programmatic work-around: Make two processes. One parent process, which does all the socket management and that spawns a child process that then deals with the actual server implementation. If the child process is killed, the parent process still owns all sockets, can still close them cleanly, can re-bind to the same address and port using SO_REUSEADDR and it can even spawn a new child process, so your server continues running.

Some references: