Search code examples
javaignite

Ignite Java Thin Client - Connection fails when one node is down


We have an ignite cluster with 3 nodes and all services connect to the cluster using java thin client.

When one of the server node goes down and the services are trying to connect, few gets connection success and few are failing with ignite cluster unavailable error. so we debugged the source and found that, during ReliableChannel object construction, it selects a random node for connection and if that node is not available, it throws client connection exception.

Ideally, we would expect it to fallback to other nodes since other nodes are available in the cluster. we see the above mentioned logic is implemented in service method of the ReliableChannel class.

Is there any specific reason for not implementing fallback during object construction and only have it on service method (any options to connect to other nodes)?

Also, is there anyway we can control the order in which nodes are connected?

ReliableChannel code snippet

ReliableChannel(
        Function<ClientChannelConfiguration, Result<ClientChannel>> chFactory,
        ClientConfiguration clientCfg
    ) throws ClientException {
        if (chFactory == null)
            throw new NullPointerException("chFactory");

        if (clientCfg == null)
            throw new NullPointerException("clientCfg");

        this.chFactory = chFactory;
        this.clientCfg = clientCfg;

        List<InetSocketAddress> addrs = parseAddresses(clientCfg.getAddresses());

        primary = addrs.get(new Random().nextInt(addrs.size())); // we already verified there is at least one address

        ch = chFactory.apply(new ClientChannelConfiguration(clientCfg).setAddress(primary)).get();

        for (InetSocketAddress a : addrs)
            if (a != primary)
                this.backups.add(a);
    }


    public <T> T service(
        ClientOperation op,
        Consumer<BinaryOutputStream> payloadWriter,
        Function<BinaryInputStream, T> payloadReader
    ) throws ClientException {
        ClientConnectionException failure = null;

        T res = null;

        int totalSrvs = 1 + backups.size();

        svcLock.lock();
        try {
            for (int i = 0; i < totalSrvs; i++) {
                try {
                    if (failure != null)
                        changeServer();

                    if (ch == null)
                        ch = chFactory.apply(new ClientChannelConfiguration(clientCfg).setAddress(primary)).get();

                    long id = ch.send(op, payloadWriter);

                    res = ch.receive(op, id, payloadReader);

                    failure = null;

                    break;
                }
                catch (ClientConnectionException e) {
                    if (failure == null)
                        failure = e;
                    else
                        failure.addSuppressed(e);
                }
            }
        }
        finally {
            svcLock.unlock();
        }

        if (failure != null)
            throw failure;

        return res;
    }

Solution

  • This one will be fixed in Apache Ignite 2.8: IGNITE-11599

    Maybe it's already fixed in GridGain which backports such fixes.