Search code examples
kubernetesignite

Unable to dynamically scale Ignite Pods in Kubernetes


We have been experimenting with the number of Ignite server pods to see the impact on performance.

One thing that we have noticed is that if the number of Ignite server pods is increased after client nodes have established communication the new pod will just fail loop with the error below.

If however the grid is destroyed (bring down all client and server nodes) and then the desired number of server nodes is launch there are no issues.

Also the above procedure is not fully dependable for anything other than launching a single Ignite server.

From reading it looks like [this stack over flow][1] post and [this documentation][2] that the issue may be that we are not launching the "Kubernetes service".

Ignite's KubernetesIPFinder requires users to configure and deploy a special Kubernetes service that maintains a list of the IP addresses of all the alive Ignite pods (nodes).

However this is the only documentation I have found and it says that it is no longer current.

Is this information still relevant for Ignite 2.11.1? If not is there some more recent documentation? If this service is indeed needed, are there some more concreate examples and information on setting them up?

Error on new Server pod:

[21:37:55,793][SEVERE][main][IgniteKernal] Failed to start manager: GridManagerAdapter [enabled=true, name=o.a.i.i.managers.discovery.GridDiscoveryManager]
class org.apache.ignite.IgniteCheckedException: Failed to start SPI: TcpDiscoverySpi [addrRslvr=null, addressFilter=null, sockTimeout=5000, ackTimeout=5000, marsh=JdkMarshaller [clsFilter=org.apache.ignite.marshaller.MarshallerUtils$1@78422efb], reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=0, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, skipAddrsRandomization=false]
    at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:281)
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:980)
    at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1985)
    at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1331)
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2141)
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1787)
    at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1172)
    at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1066)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:952)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:851)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:721)
    at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:690)
    at org.apache.ignite.Ignition.start(Ignition.java:353)
    at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:367)
Caused by: class org.apache.ignite.spi.IgniteSpiException: Node with the same ID was found in node IDs history or existing node in topology has the same ID (fix configuration and restart local node) [localNode=TcpDiscoveryNode [id=000e84bb-f587-43a2-a662-c7c6147d2dde, consistentId=8751ef49-db25-4cf9-a38c-26e23a96a3e4, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, fd00:85:4001:5:f831:8cc:cd3:f863%eth0], sockAddrs=HashSet [nkw-mnomni-ignite-1-1-1.nkw-mnomni-ignite-1-1.680e5bbc-21b1-5d61-8dfa-6b27be10ede7.svc.cluster.local/fd00:85:4001:5:f831:8cc:cd3:f863:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=0, intOrder=0, lastExchangeTime=1676497065109, loc=true, ver=2.11.1#20211220-sha1:eae1147d, isClient=false], existingNode=000e84bb-f587-43a2-a662-c7c6147d2dde]
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.duplicateIdError(TcpDiscoverySpi.java:2083)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1201)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:473)
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2207)
    at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)
    ... 13 more

Server DiscoverySpi Config:

<property name="discoverySpi"> 
            <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> 
                <property name="ipFinder"> 
                    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder"> 
                        <property name="namespace" value="myNameSpace"/> 
                        <property name="serviceName" value="myServiceName"/> 
                    </bean> 
                </property> 
            </bean> 
        </property> 

Client DiscoverySpi Configs:

<bean id="discoverySpi" class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
        <property name="ipFinder" ref="ipFinder" />
    </bean>

    <bean id="ipFinder" class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
        <property name="shared" value="false" />
        <property name="addresses">
            <list>
                <value>myServiceName.myNameSpace:47500</value>
            </list>
        </property>
    </bean>

Edit:

I have experimented more with this issue. As long as I do not deploy any clients (using the static TcpDiscoveryVmIpFinder above) I am able to scale up and down server pods without any issue. However as soon as a single client joins I am no longer able to scale the server pods up.

I can see that the server pods have ports 47500 and 47100 open so I am not sure what the issue is. Dows the TcpDiscoveryKubernetesIpFinder still need the port to be specified on the client config?

I have tried to change my client config to use the TcpDiscoveryKubernetesIpFinder below but I am getting a discovery timeout falure (see below).

    <property name="discoverySpi"> 
        <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> 
            <property name="ipFinder"> 
                <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder"> 
                    <property name="namespace" value="680e5bbc-21b1-5d61-8dfa-6b27be10ede7"/> 
                    <property name="serviceName" value="nkw-mnomni-ignite-1-1"/> 
                </bean> 
            </property> 
        </bean> 
    </property> 
24-Feb-2023 14:15:02.450 WARNING [grid-timeout-worker-#22%igniteClientInstance%] org.apache.ignite.logger.java.JavaLogger.warning Thread dump at 2023/02/24 14:15:02 UTC
Thread [name="main", id=1, state=WAITING, blockCnt=78, waitCnt=3]
    Lock [object=java.util.concurrent.CountDownLatch$Sync@45296dbd, ownerName=null, ownerId=-1]
        at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
        at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
        at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
        at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1047)
        at [email protected]/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
        at o.a.i.spi.discovery.tcp.ClientImpl.spiStart(ClientImpl.java:324)
        at o.a.i.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2207)
        at o.a.i.i.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)
        at o.a.i.i.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:980)
        at o.a.i.i.IgniteKernal.startManager(IgniteKernal.java:1985)
        at o.a.i.i.IgniteKernal.start(IgniteKernal.java:1331)
        at o.a.i.i.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2141)
        at o.a.i.i.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1787)
        - locked o.a.i.i.IgnitionEx$IgniteNamedInstance@57ac9100
        at o.a.i.i.IgnitionEx.start0(IgnitionEx.java:1172)
        at o.a.i.i.IgnitionEx.startConfigurations(IgnitionEx.java:1066)
        at o.a.i.i.IgnitionEx.start(IgnitionEx.java:952)
        at o.a.i.i.IgnitionEx.start(IgnitionEx.java:851)
        at o.a.i.i.IgnitionEx.start(IgnitionEx.java:721)
        at o.a.i.i.IgnitionEx.start(IgnitionEx.java:690)
        at o.a.i.Ignition.start(Ignition.java:353)

Edit 2: I also spoke with an admin about opening client side ports in case that was the issue. He indicated that should not be needed as clients should be able to open ephemeral ports to communicate with the server nodes.
[1]: Ignite not discoverable in kubernetes cluster with TcpDiscoveryKubernetesIpFinder [2]: https://apacheignite.readme.io/docs/kubernetes-ip-finder


Solution

  • Removing the property <property name="shared" value="false" /> from my client config solved the problem in the end. Still not certain why the TcpDiscoveryKubernetesIpFinder does not work.