Ignite node not able to join the cluster, waiting for coordinator response indefinitely

I am running two server nodes (A and B ) of Apache Ignite 2.7.0 with TcpDiscoveryJdbcIpFinder for discovery.

When I start B as a first node and then starts A node, everything works fine.

But, when I start A as a first node and then starts B, B node stuck indefinitely, trying to join the cluster.

When I checked the logs I found out that node B joined the cluster, partition exchanged started.

  2019-09-05 10:59:51,850 | disco-event-worker-#39         | INFO  | org.apache.ignite.internal.managers.discovery.GridDiscoveryManager               | 
Added new node to topology: TcpDiscoveryNode [id=686bdf14-201c-43f3-8617-05c7e51224ea, addrs=[10.49.95.44], sockAddrs=[some2.domain/10.49.95.44:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1567673970663, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false] |  
2019-09-05 10:59:51,850 | disco-event-worker-#39         | DEBUG | org.apache.ignite.internal.managers.discovery.GridDiscoveryManager               | /
//
/>>> +----------------+/
/>>> Topology snapshot./
/>>> +----------------+/
/>>> Ignite instance name: default/
/>>> <b>Number of server nodes: 2</b>/
/>>> Number of client nodes: 0/
/>>> Topology version: 2/
/>>> Local: F5DBEC80-D22F-4977-A534-A0E9425A77BB, [some.domain/10.49.94.205], 1, Windows Server 2012 R2 amd64 6.3, admBruegel, Java(TM) SE Runtime Environment 1.8.0_202-b08/
/>>> Remote: 686BDF14-201C-43F3-8617-05C7E51224EA, [some2.domain/10.49.95.44], 2, Windows Server 2016 amd64 10.0, admBruegel, Java(TM) SE Runtime Environment 1.8.0_202-b08/
/>>> Total number of CPUs: 4/
/>>> Total heap size: 32.0GB/
/>>> Total offheap size: 4.9GB/

After a while, node A receives NODE_FAILED event for node B, even though it is running and waiting for Node A to complete the joining process.

2019-09-05 10:59:51,881 | disco-event-worker-#39         | WARN  | org.apache.ignite.internal.managers.discovery.GridDiscoveryManager               | 
Node FAILED: TcpDiscoveryNode [id=686bdf14-201c-43f3-8617-05c7e51224ea, addrs=[10.49.95.44], sockAddrs=[some2.domain/10.49.95.44:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1567673970663, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false] |  
2019-09-05 10:59:51,881 | disco-event-worker-#39         | DEBUG | org.apache.ignite.internal.managers.discovery.GridDiscoveryManager               | /
//
/>>> +----------------+/
/>>> Topology snapshot./
/>>> +----------------+/
/>>> Ignite instance name: default/
/>>> Number of server nodes: 1/
/>>> Number of client nodes: 0/
/>>> Topology version: 3/
/>>> Local: F5DBEC80-D22F-4977-A534-A0E9425A77BB, [some.domain/10.49.94.205], 1, Windows Server 2012 R2 amd64 6.3, admBruegel, Java(TM) SE Runtime Environment 1.8.0_202-b08/
/>>> Total number of CPUs: 2/
/>>> Total heap size: 16.0GB/
/>>> Total offheap size: 2.5GB/
/ |  
2019-09-05 10:59:51,881 | disco-net-seg-chk-worker-#38   | DEBUG | org.apache.ignite.internal.managers.discovery.GridDiscoveryManager               
| Segment has been checked [requested=true, valid=true] |  
2019-09-05 10:59:51,881 | disco-event-worker-#39         | DEBUG | org.apache.ignite.internal.managers.deployment.GridDeploymentPerVersionStore     
| Processing node departure event: DiscoveryEvent [evtNode=TcpDiscoveryNode [id=686bdf14-201c-43f3-8617-05c7e51224ea, addrs=[10.49.95.44], sockAddrs=[some2.domain/10.49.95.44:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1567673970663, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], topVer=3, nodeId8=f5dbec80, msg=Node failed: TcpDiscoveryNode [id=686bdf14-201c-43f3-8617-05c7e51224ea, addrs=[10.49.95.44], sockAddrs=[some2.domain/10.49.95.44:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1567673970663, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], type=NODE_FAILED, tstamp=1567673991881] |  
2019-09-05 10:59:51,881 | disco-event-worker-#39         | DEBUG | org.apache.ignite.internal.processors.cache.GridCacheMvccManager                 
| Processing node left [nodeId=686bdf14-201c-43f3-8617-05c7e51224ea] |  
2019-09-05 10:59:51,897 | disco-event-worker-#39         | DEBUG | org.apache.ignite.internal.processors.cache.GridCacheDeploymentManager           
| Processing node departure: 686bdf14-201c-43f3-8617-05c7e51224ea |  
2019-09-05 10:59:51,897 | disco-event-worker-#39         | DEBUG | org.apache.ignite.internal.managers.deployment.GridDeploymentLocalStore          
| Deployment meta for local deployment: GridDeploymentMetadata [depMode=SHARED, alias=org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$$Lambda$153/1190953783, clsName=org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager, userVer=null, sndNodeId=f5dbec80-d22f-4977-a534-a0e9425a77bb, clsLdrId=null, clsLdr=null, participants=null, parentLdr=null, record=true, nodeFilter=null, seqNum=n/a] |  
2019-09-05 10:59:51,897 | disco-event-worker-#39         | DEBUG | org.apache.ignite.spi.deployment.local.LocalDeploymentSpi                        
| Registering [ldrRsrcs={ParallelWebappClassLoader

Node B keeps getting Join request message has been sent (waiting for coordinator response) and waits forever.

I increased the networkTimeout and failureDetectionTimeout of IgniteConfiguration to

 <property name="failureDetectionTimeout" value="120000"/>
 <property name="networkTimeout" value="120000"/>

and networkTimeout and joinTimeout of discoverySpi to

 <property name="networkTimeout" value="120000"/>
  <property name="joinTimeout" value="90000"/>

Still, the issue persists.

Both nodes can ping each other and there is no firewall between these nodes, so no ports are blocked. These are the logs of both the nodes.

I am not able to figure out why this is happening. This same set up runs fine on other servers.

Solution

It's possible that Node A can talk to Node B via discovery port (47500) but not communication port (47100).

It's also possible that there's something on either node that slows down initial exchange. For example, if node B can't resolve one of node A's addresses, this may cause initial exchange to stall (check your DNS settings, etc).