Search code examples
javahibernatekubernetesignite

Ignite Cluster becomes unresponsive when relaunching client nodes


We are intermittently seeing the following error on our k8tes setup. The issue happens after we relaunch our tomcat pod which launches new Ignite client nodes.

I understand the first stack trace shows that Ignite has detected that the tcp communications spi has become unresponsive but I do not see how this has anything to do with the second stack trace. This seems like two totally unrelated errors but second one says the thread dump is at the same timestamp as the first. Thread dump at 2021/10/12 15:57:17

The issue can resolved by bringing down all the Ignite pods and relaunching them but A better understanding of this issue and a way to not need to restart Ignite would be apricated.

12-Oct-2021 15:57:17.139 WARNING [grid-timeout-worker-#134%igniteClientInstance%] org.apache.ignite.logger.java.JavaLogger.warning Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=tcp-comm-worker, igniteInstanceName=igniteClientInstance, finished=false, heartbeatTs=1634054222218]]]
class org.apache.ignite.IgniteException: GridWorker [name=tcp-comm-worker, igniteInstanceName=igniteClientInstance, finished=false, heartbeatTs=1634054222218]
        at java.base/sun.nio.ch.Net.poll(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.pollConnected(SocketChannelImpl.java:991)
        at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:119)
        at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:465)
        at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:691)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:1255)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$$Lambda$389/0x0000000012e5ffc0.apply(Unknown Source)
        at org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:689)
        at org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.createCommunicationClient(ConnectionClientPool.java:453)
        at org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.reserveClient(ConnectionClientPool.java:228)
        at org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.processDisconnect(CommunicationWorker.java:374)
        at org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.body(CommunicationWorker.java:174)
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
        at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$6.body(TcpCommunicationSpi.java:923)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
12-Oct-2021 15:57:17.141 WARNING [grid-timeout-worker-#134%igniteClientInstance%] org.apache.ignite.logger.java.JavaLogger.warning No deadlocked threads detected.
12-Oct-2021 15:57:17.170 WARNING [grid-timeout-worker-#134%igniteClientInstance%] org.apache.ignite.logger.java.JavaLogger.warning Thread dump at 2021/10/12 15:57:17 GMT
Thread [name="main", id=1, state=RUNNABLE, blockCnt=19, waitCnt=416]
        at java.base/java.net.SocketInputStream.socketRead0(Native Method)
        at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
        - locked java.io.BufferedInputStream@263909ea
        at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:256)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1163)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:188)
        - locked org.postgresql.core.v3.QueryExecutorImpl@1b338a37
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:437)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:353)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:257)
        at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery(NewProxyPreparedStatement.java:116)
        at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:70)
        at org.hibernate.loader.Loader.getResultSet(Loader.java:2123)
        at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1911)
        at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1887)
        at org.hibernate.loader.Loader.doQuery(Loader.java:932)
        at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:349)
        at org.hibernate.loader.Loader.doList(Loader.java:2615)
        at org.hibernate.loader.Loader.doList(Loader.java:2598)
        at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2430)
        at org.hibernate.loader.Loader.list(Loader.java:2425)
        at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:502)
        at org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:370)
        at org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:216)
        at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1481)
        at org.hibernate.query.internal.AbstractProducedQuery.doList(AbstractProducedQuery.java:1441)
        at org.hibernate.query.internal.AbstractProducedQuery.list(AbstractProducedQuery.java:1410)
        at org.hibernate.Query.getResultList(Query.java:427)
        at com.foo.dao.hibernate.report.FooBarImpl.retrieveFoo(FooBarImpl.java:61)
        at jdk.internal.reflect.GeneratedMethodAccessor513.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Solution

  • When Ignite fails via FailureHandler, it make a thread dump of all of their threads (for analysis if it needs). Your second stacktrace looks like a part of thread dump.