We have an ignite cluster with 3 nodes and all services connect to the cluster using java thin client.
When one of the server node goes down and the services are trying to connect, few gets connection success and few are failing with ignite cluster unavailable error. so we debugged the source and found that, during ReliableChannel object construction, it selects a random node for connection and if that node is not available, it throws client connection exception.
Ideally, we would expect it to fallback to other nodes since other nodes are available in the cluster. we see the above mentioned logic is implemented in service method of the ReliableChannel class.
Is there any specific reason for not implementing fallback during object construction and only have it on service method (any options to connect to other nodes)?
Also, is there anyway we can control the order in which nodes are connected?
ReliableChannel code snippet
ReliableChannel(
Function<ClientChannelConfiguration, Result<ClientChannel>> chFactory,
ClientConfiguration clientCfg
) throws ClientException {
if (chFactory == null)
throw new NullPointerException("chFactory");
if (clientCfg == null)
throw new NullPointerException("clientCfg");
this.chFactory = chFactory;
this.clientCfg = clientCfg;
List<InetSocketAddress> addrs = parseAddresses(clientCfg.getAddresses());
primary = addrs.get(new Random().nextInt(addrs.size())); // we already verified there is at least one address
ch = chFactory.apply(new ClientChannelConfiguration(clientCfg).setAddress(primary)).get();
for (InetSocketAddress a : addrs)
if (a != primary)
this.backups.add(a);
}
public <T> T service(
ClientOperation op,
Consumer<BinaryOutputStream> payloadWriter,
Function<BinaryInputStream, T> payloadReader
) throws ClientException {
ClientConnectionException failure = null;
T res = null;
int totalSrvs = 1 + backups.size();
svcLock.lock();
try {
for (int i = 0; i < totalSrvs; i++) {
try {
if (failure != null)
changeServer();
if (ch == null)
ch = chFactory.apply(new ClientChannelConfiguration(clientCfg).setAddress(primary)).get();
long id = ch.send(op, payloadWriter);
res = ch.receive(op, id, payloadReader);
failure = null;
break;
}
catch (ClientConnectionException e) {
if (failure == null)
failure = e;
else
failure.addSuppressed(e);
}
}
}
finally {
svcLock.unlock();
}
if (failure != null)
throw failure;
return res;
}
This one will be fixed in Apache Ignite 2.8: IGNITE-11599
Maybe it's already fixed in GridGain which backports such fixes.