Is a FIN_WAIT2 state ever due to a close-connection initiator?

I'm doing a bit of POSIX socket programming and am running into a problem. I've wrote an application that uses nonblocking sockets. Because I'm currently developing a client against a server that is in develop, bad application level messages occasionally get sent. When this happens, my only way to recover is to completely re-establish a connection to force the communication to a known state.

I reset the connection by issuing a POSIX "close()" on the non-blocking socket. I then request a new a socket and re-establish connection.

However, one thing I've discovered is that all of resets lead to a "FIN_WAIT2" on the old connections. When running a netstat command, there are a ton of FIN_WAIT2 that have no PIDs associated with them (I guess these are considered orphaned and could try a kernel connection timeout?).

Anyways, I'm curious why all of these old connections are stacking up in the netstat. I've done a bit of reading up on TCP states, and it seems that the FIN_WAIT2 I'm seeing is due to the server (i.e. not the closing initiator in my case) isn't responding with a message that it successfully closed. Why is this?

Is a FIN_WAIT2 typically associated with a bug in the non-close initiator side? Or is it possible that I'm doing something in my application that prevents the FIN message from being received? Does the fact I'm using non-blocking sockets have anything to do with it?

Solution

In a word, yes, you probably have a bug on the non-close-initiator side. It has nothing to do with non-blocking sockets though. Non-blocking sockets only affect interactions between your application and its own operating system.

It's important to understand that both sides must terminate a socket connection in order for the state to get properly cleaned up. It sounds like your server is not closing its end of the socket. One possible scenario:

Server creates listening socket, binds it, etc.
Server calls accept
Client calls connect creating the connection (TCP state on both sides moves to ESTABLISHED
send / recv / send / recv / etc (state still ESTABLISHED)
Client calls close; client OS sends FIN packet to server (client OS moves socket state to FIN_WAIT1)
Server OS sends ACK to acknowledge the client machine's FIN (server OS moves socket state to CLOSE_WAIT; client OS moves socket state to FIN_WAIT2)
Server (program) never closes its socket, and hence server OS never sends FIN, so client OS will maintain the socket in FIN_WAIT2 state. (Server socket state says in CLOSE_WAIT)

The client-side socket state could stay in FIN_WAIT2 state for a long time, or even forever, depending on the OS implementation. Linux, for example, has a tunable variable tcp_fin_timeout that specifies how long an otherwise idle connection will remain in FIN_WAIT2; but the TCP standard does not specify a time-out for FIN_WAIT2. (Note that the client program is not aware of any of this. It has closed the socket, the socket file descriptor has been destroyed and the socket is no longer accessible to it; this is all handled by the operating system.)

If that is what happened, you could try restarting the server program (because when you terminate the server process, the server's operating system will automatically close all its open files, and that will cause FINs to be sent on any still-open sockets). I think you will see that restarting the server causes all those client-side sockets to move into TIME_WAIT state, where they will stay for a short time before disappearing on their own. (There is a timeout mechanism specified for TIME_WAIT.)