I am trying to understand the warning, every time i am seeing the below exception when i run my spark job .I am seeing this in 2 nodes of my 3 node cluster.But as i said its just warn , job succeeds how ever.
com.datastax.driver.core.exceptions.ConnectionException: [x.x.x.x/x.x.x.x:9042] Pool was closed during initialization
CASSANDRA LOG
INFO [SharedPool-Worker-1] 2017-07-17 22:25:48,716 Message.java:605 - Unexpected exception during request; channel = [id: 0xf0ee1096, /x.x.x.x:54863 => /x.x.x.x:9042] io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out at io.netty.channel.unix.Errors.newIOException(Errors.java:105) ~[netty-all-4.0.34.Final.jar:4.0.34.Final] at io.netty.channel.unix.Errors.ioResult(Errors.java:121) ~[netty-all-4.0.34.Final.jar:4.0.34.Final] at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:134) ~[netty-all-4.0.34.Final.jar:4.0.34.Final] at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:239) ~[netty-all-4.0.34.Final.jar:4.0.34.Final] at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:822) ~[netty-all-4.0.34.Final.jar:4.0.34.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:348) ~[netty-all-4.0.34.Final.jar:4.0.34.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.34.Final.jar:4.0.34.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) ~[netty-all-4.0.34.Final.jar:4.0.34.Final] at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.34.Final.jar:4.0.34.Final] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
The core of the error is "Connection timed out". I recommend troubleshooting network connectivity to the Cassandra cluster, starting with simpler tools such as ping
, telnet
and nc
. Some potential causes:
You mentioned that the problem is intermittent ("seeing this in 2 nodes of my 3 node cluster") and does not cause job failure. This could be an indicator that any of the problems listed above is happening for just a subset of nodes in the cluster. (If connectivity to all nodes was broken, then the job likely would have failed.)