Search code examples
javatcptimeoutnio

Non-Blocking NIO SocketChannel causes IOException timeout on read?


I'm still learning NIO and came up with a simple Ping/Pong application. My client is written in NIO. The server is the telnet-endpoint of Memcached. The client is constantly issuing stats\r\n and the server responds with some statistics.

This setup is working in a non-blocking way with SocketChannel and Selector.select()

The part that I do not fully understand is the error case where the connection between client and server get in limbo state or the server went away without closing the connection (During testing I simulate that with a iptables DROP rule):

I end up (on OSX and Linux) with the following stacktrace on the read operation after a successful select() operation:

java.io.IOException: Operation timed out
  at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_73]
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_73]
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_73]
  at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_73]
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.8.0_73]

The code in use looks something like this:

    Selector selector = Selector.open();
    SocketChannel memcachedClient = SocketChannel.open(MEMCACHED_HOST);
    memcachedClient.configureBlocking(false);

    int ops = memcachedClient.validOps();
    memcachedClient.register(selector, ops, null);

    while (true) {
        LOGGER.info("Masterloop - Start");
        //selector.select();
        selector.select(1 * 1000);
        LOGGER.debug("Masterloop - After select");

        Set<SelectionKey> selectedKeys = selector.selectedKeys();
        Iterator<SelectionKey> selectedKeysIterator = selectedKeys.iterator();
        LOGGER.debug("Found Selectors {}", selectedKeys.size());

        while (selectedKeysIterator.hasNext()) {
            SelectionKey key = selectedKeysIterator.next();

            LOGGER.info("Key: w={} r={} v={} - {}",
                    new Object[] { key.isWritable(), key.isReadable(), key.isValid(), key.toString() });

            if (key.isReadable()) {
                read(key);
            } else if (key.isWritable()) {
                write(key);
            }

            selectedKeysIterator.remove();
        }
    }

What is strange to me are the two observed behaviors depending on the usage of selector.select() and selector.select(timeout):

Selector.select() blocks and fails

Without a timeout, the Selector.select() call blocks for ~50 seconds after I let all packages DROP (or unplug the network cable) and comes back with a single key:

2016-07-12 11:51:55 INFO  App:88 - Masterloop - Start
2016-07-12 11:52:49 DEBUG App:92 - Masterloop - After select
2016-07-12 11:52:49 DEBUG App:96 - Found Selectors 1
2016-07-12 11:52:49 INFO  App:101 - Key: w=false r=true v=true - sun.nio.ch.SelectionKeyImpl@2e817b38

and I end up with the stacktrace above in my read() method:

private static void read(SelectionKey key) throws IOException {
    SocketChannel socketChannel = (SocketChannel) key.channel();

    ByteBuffer readBuffer = ByteBuffer.allocate(8192);

    int numRead;
    try {
        numRead = socketChannel.read(readBuffer); // Exception is thrown here
    } catch (IOException e) {
        LOGGER.warn(e.getMessage(), e);
        key.cancel();
        socketChannel.close();
        return;
    }
// do something with readBuffer
key.interestOps(SelectionKey.OP_WRITE);
}

Selector.select(timeout) returns but throws an IOException later, too

With an timeout (i.e. 1000 ms), the Selector.select(1000) call returns after I let all packages DROP (or unplug the network cable) and with zero keys. That happens constantly as my while-loop continues to run. Again after ~45 seconds the then-active loop iteration returns a single, readable key that ends again with an IOException on read()

2016-07-12 11:54:24 INFO  App:113 - ================
2016-07-12 11:54:24 INFO  App:88 - Masterloop - Start
2016-07-12 11:54:25 DEBUG App:92 - Masterloop - After select
2016-07-12 11:54:25 DEBUG App:96 - Found Selectors 0
2016-07-12 11:54:25 INFO  App:113 - ================
2016-07-12 11:54:25 INFO  App:88 - Masterloop - Start
2016-07-12 11:54:26 DEBUG App:92 - Masterloop - After select
2016-07-12 11:54:26 DEBUG App:96 - Found Selectors 0
2016-07-12 11:54:26 INFO  App:113 - ================
...
2016-07-12 11:55:16 INFO  App:113 - ================
2016-07-12 11:55:16 INFO  App:88 - Masterloop - Start
2016-07-12 11:55:17 DEBUG App:92 - Masterloop - After select
2016-07-12 11:55:17 DEBUG App:96 - Found Selectors 0
2016-07-12 11:55:17 INFO  App:113 - ================
2016-07-12 11:55:17 INFO  App:88 - Masterloop - Start
2016-07-12 11:55:18 DEBUG App:92 - Masterloop - After select
2016-07-12 11:55:18 DEBUG App:96 - Found Selectors 1
2016-07-12 11:55:18 INFO  App:101 - Key: w=false r=true v=true - sun.nio.ch.SelectionKeyImpl@2e817b38
2016-07-12 11:55:18 WARN  App:30 - Operation timed out
java.io.IOException: Operation timed out
  at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_73]
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_73]
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_73]
  at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_73]
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.8.0_73]

Question

I somewhat understand the behavior in general: It is impossible to judge if the server went away or the connection is still active without actively probing it.

Looking at Wireshark I see a bunch of resubmission (which makes sense as the server never ACKed my client's request) and after the 8th resubmission the IOException happens. But how can I influence this timeout? Is that only on the TCP level? If so which settings (kernel?) are these? I played with some settings but they seem to work only together with KEEP_ALIVE. Can't I get a feedback from NIO earlier (say 10 seconds) that something went wrong?

Thanks!


Solution

    1. The operation that has timed out is a prior send. This is delivered to the read() immediately, not after a timeout.
    2. You aren't distinguishing between select success (returns > 0) and select timeout (returns zero).
    3. When you get (1), the select will have returned > 0 with the affected channel marked as readable: however what is there to be read is not data but the error. (It can also be end of stream, if the peer has disconnected.)