Search code examples
ignitegridgain

Node getting disconnected in 2 server grid in Apache Ignite


I have 2 server grid in Apache Ignite. While database loading into cache, one of nodes is getting disconnected, following is the error message I am getting. I have also tried setting FailureDetectionTimeout and NetworkTimeout values to their maximum limit e.g. 2147483647. I have also tried JVM tuning at both the nodes as mentioned in post JVM Tuning, but still I am getting same error

[16:30:31,244][SEVERE][pub-#96%null%][DataStreamProcessor] Failed to respond to node [nodeId=797bf03b-3baf-4724-8eca-ccccec64605c, res=DataStreamerResponse [reqId=34834, forceLocDep=true]]
class org.apache.ignite.IgniteCheckedException: Failed to send message (node may have left the grid or TCP connection cannot be established due to firewall issues) [node=TcpDiscoveryNode [id=797bf03b-3baf-4724-8eca-ccccec64605c, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.42.1, 127.0.0.1, 192.168.140.52], sockAddrs=[01hw146471/192.168.140.52:47500, /10.0.42.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1478687030160, loc=false, ver=1.7.0#20160801-sha1:383273e3, isClient=false], topic=T1 [topic=TOPIC_DATASTREAM, id=803ded84851-797bf03b-3baf-4724-8eca-ccccec64605c], msg=DataStreamerResponse [reqId=34834, forceLocDep=true], policy=0]
	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1309)
	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1361)
	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1331)
	at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.sendResponse(DataStreamProcessor.java:348)
	at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(DataStreamProcessor.java:313)
	at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(DataStreamProcessor.java:50)
	at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(DataStreamProcessor.java:80)
	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1238)
	at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:866)
	at org.apache.ignite.internal.managers.communication.GridIoManager.access$1700(GridIoManager.java:106)
	at org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:829)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: class org.apache.ignite.spi.IgniteSpiException: Failed to send message to remote node: TcpDiscoveryNode [id=797bf03b-3baf-4724-8eca-ccccec64605c, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.42.1, 127.0.0.1, 192.168.140.52], sockAddrs=[01hw146471/192.168.140.52:47500, /10.0.42.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1478687030160, loc=false, ver=1.7.0#20160801-sha1:383273e3, isClient=false]
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1996)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1936)
	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1304)
	... 13 more
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=797bf03b-3baf-4724-8eca-ccccec64605c, addrs=[01hw146471/192.168.140.52:47100, /10.0.42.1:47100, /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2499)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2140)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2034)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1970)
	... 15 more
	Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address: 01hw146471/192.168.140.52:47100
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2504)
		... 18 more
	Caused by: class org.apache.ignite.IgniteCheckedException: Failed to read remote node recovery handshake (connection closed).
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2709)
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2371)
		... 18 more
	Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address: /10.0.42.1:47100
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2504)
		... 18 more
	Caused by: java.net.ConnectException: Connection refused
		at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
		at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
		at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:111)
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2363)
		... 18 more
	Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address: /0:0:0:0:0:0:0:1%lo:47100
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2504)
		... 18 more
	Caused by: class org.apache.ignite.IgniteCheckedException: Remote node ID is not as expected [expected=797bf03b-3baf-4724-8eca-ccccec64605c, rcvd=54ac75f7-7b87-4502-ba8c-1e3a82e87be3]
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2614)
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2371)
		... 18 more
	Suppressed: class org.apache.ignite.IgniteCheckedException: Failed to connect to address: /127.0.0.1:47100
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2504)
		... 18 more
	Caused by: class org.apache.ignite.IgniteCheckedException: Remote node ID is not as expected [expected=797bf03b-3baf-4724-8eca-ccccec64605c, rcvd=54ac75f7-7b87-4502-ba8c-1e3a82e87be3]
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2614)
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2371)
		... 18 more

[16:30:31] Topology snapshot [ver=7, servers=1, clients=0, CPUs=48, heap=50.0GB]


Solution

  • This message usually means that the destination node is already dead, or unresponsive. Make sure that:

    • Both nodes have enough heap and do not run out of memory and do not suffer from long GC pauses.
    • Network is stable and both nodes can connect to each other either way.