Epoll zero recv() and negative(EAGAIN) send()

I was struggling with epoll last days and I'm in the middle of nowhere right now ;)

There's a lot of information on the Internet and obviously in the system man but I probably took an overdose and a bit confused.

In my server app(backend to nginx) I'm waiting for data from clients in the ET mode:

event_template.events = EPOLLIN | EPOLLRDHUP | EPOLLET

Everything has become curious when I have noticed that nginx is responding with 502 despite I could see successful send() on my side. I run wireshark to sniff and have realised that my server sends(trying and getting RST) data to another machine on the net. So, I decided that socket descriptor is invalid and this is sort of "undefined behaviour". Finally, I found out that on a second recv() I'm getting zero bytes which means that connection has to be closed and I'm not allowed to send data back anymore. Nevertheless, I was getting from epoll not just EPOLLIN but EPOLLRDHUP in a row.

Question: Do I have to close socket just for reading when recv() returns zero and shutdown(SHUT_WR) later on during EPOLLRDHUP processing?

Reading from socket in a nutshell:

    std::array<char, BatchSize> batch;
    ssize_t total_count = 0, count = 0;
    do {
        count = recv(_handle, batch.begin(), batch.size(), MSG_DONTWAIT);

        if (0 == count && 0 == total_count) {
            /// @??? Do I need to wait zero just on first iteration?
            close();
            return total_count;
        } else if (count < 0) {
            if (errno == EAGAIN || errno == EWOULDBLOCK) {
                /// @??? Will be back with next EPOLLIN?!
                break ;
            }
            _last_error = errno;
            /// @brief just log the error               
            return 0;
        }

        if (count > 0) {
            total_count += count;
            /// DATA!
            if (count < batch.size()) {
                /// @??? Received less than requested - no sense to repeat recv, otherwise I need one more turn?! 
                return total_count;
            }
        }           
    } while (count > 0);

Probably, my the general mistake was attempt to send data on invalid socket descriptor and everything what happens later is just a consequence. But, I continued to dig ;) My second part of a question is about writing to a socket in MSG_DONTWAIT mode as well.

As far as I now know, send() may also return -1 and EAGAIN which means that I'm supposed to subscribe on EPOLLOUT and wait when kernel buffer will be free enough to receive some data from my me. Is this right? But what if client won't wait so long? Or, may I call blocking send(anyway, I'm sending on a different thread) and guarantee the everything what I send to kernel will be really sent to peer because of setsockopt(SO_LINGER)? And a final guess which I ask to confirm: I'm allowed to read and write simultaneously, but N>1 concurrent writes is a data race and everything that I have to deal with it is a mutex.

Thanks to everyone who at least read to the end :)

Solution

Questions: Do I have to close socket just for reading when recv() returns zero and shutdown(SHUT_WR) later on during EPOLLRDHUP processing?

No, there is no particular reason to perform that somewhat convoluted sequence of actions.

Having received a 0 return value from recv(), you know that the connection is at least half-closed at the network layer. You will not receive anything further from it, and I would not expect EPoll operating in edge-triggered mode to further advertise its readiness for reading, but that does not in itself require any particular action. If the write side remains open (from a local perspective) then you may continue to write() or send() on it, though you will be without a mechanism for confirming receipt of what you send.

What you actually should do depends on the application-level protocol or message exchange pattern you are assuming. If you expect the remote peer to shutdown the write side of its endpoint (connected to the read side of the local endpoint) while awaiting data from you then by all means do send the data it anticipates. Otherwise, you should probably just close the whole connection and stop using it when recv() signals end-of-file by returning 0. Note well that close()ing the descriptor will remove it automatically from any Epoll interest sets in which it is enrolled, but only if there are no other open file descriptors referring to the same open file description.

Any way around, until you do close() the socket, it remains valid, even if you cannot successfully communicate over it. Until then, there is no reason to expect that messages you attempt to send over it will go anywhere other than possibly to the original remote endpoint. Attempts to send may succeed, or they may appear to do even though the data never arrive at the far end, or the may fail with one of several different errors.

            /// @??? Do I need to wait zero just on first iteration?

You should take action on a return value of 0 whether any data have already been received or not. Not necessarily identical action, but either way you should arrange one way or another to get it out of the EPoll interest set, quite possibly by closing it.

                /// @??? Will be back with next EPOLLIN?!

If recv() fails with EAGAIN or EWOULDBLOCK then EPoll might very well signal read-readiness for it on a future call. Not necessarilly the very next one, though.

                /// @??? Received less than requested - no sense to repeat recv, otherwise I need one more turn?!

Receiving less than you requested is a possibility you should always be prepared for. It does not necessarily mean that another recv() won't return any data, and if you are using edge-triggered mode in EPoll then assuming the contrary is dangerous. In that case, you should continue to recv(), in non-blocking mode or with MSG_DONTWAIT, until the call fails with EAGAIN or EWOULDBLOCK.

As far as I now know, send() may also return -1 and EAGAIN which means that I'm supposed to subscribe on EPOLLOUT and wait when kernel buffer will be free enough to receive some data from my me. Is this right?

send() certainly can fail with EAGAIN or EWOULDBLOCK. It can also succeed, but send fewer bytes than you requested, which you should be prepared for. Either way, it would be reasonable to respond by subscribing to EPOLLOUT events on the file descriptor, so as to resume sending later.

But what if client won't wait so long?

That depends on what the client does in such a situation. If it closes the connection then a future attempt to send() to it would fail with a different error. If you were registered only for EPOLLOUT events on the descriptor then I suspect it would be possible, albeit unlikely, to get stuck in a condition where that attempt never happens because no further event is signaled. That likelihood could be reduced even further by registering for and correctly handling EPOLLRDHUP events, too, even though your main interest is in writing.

If the client gives up without ever closing the connection then EPOLLRDHUP probably would not be useful, and you're more likely to get the stale connection stuck indefinitely in your EPoll. It might be worthwhile to address this possibility with a per-FD timeout.

Or, may I call blocking send(anyway, I'm sending on a different thread) and guarantee the everything what I send to kernel will be really sent to peer because of setsockopt(SO_LINGER)?

If you have a separate thread dedicated entirely to sending on that specific file descriptor then you can certainly consider blocking send()s. The only drawback is that you cannot implement a timeout on top of that, but other than that, what would such a thread do if it blocking either on sending data or on receiving more data to send?

I don't see quite what SO_LINGER has to do with it, though, at least on the local side. The kernel will make every attempt to send data that you have already dispatched via a send() call to the remote peer, even if you close() the socket while data are still buffered, regardless of the value of SO_LINGER. The purpose of that option is to receive (and drop) straggling data associated with the connection after it is closed, so that they are not accidentally delivered to another socket.

None of this can guarantee that the data are successfully delivered to the remote peer, however. Nothing can guarantee that.

And a final guess which I ask to confirm: I'm allowed to read and write simultaneously, but N>1 concurrent writes is a data race and everything that I have to deal with it is a mutex.

Sockets are full-duplex, yes. Moreover, POSIX requires most functions, including send() and recv(), to be thread safe. Nevertheless, multiple threads writing to the same socket is asking for trouble, for the thread safety of individual calls does not guarantee coherency across multiple calls.