Search code examples
javaignitegridgain

Apache Ignite Client restart scenario


This is the scenario

  1. I started the Server node.
  2. I started Client Ignite node which will be done via a Java application say "X".
  3. In visor I could see two nodes one is client and one is server when given command "node".
  4. I killed the Java app "X" by doing "kill -9 pid".
  5. Now when I go to visor terminal and enter "node" it still shows "client" and "server" nodes in the list. when asked about client node details it throws error obviously.
  6. Now, when I restart the Java app "X", in that Java code again there will be an attempt to connect to Ignite server. But instead of connecting it is printing these logs so many times

"org.apache.ignite.logger.java.JavaLogger" "info" "INFO" "" "284" "Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47101, rmtAddr=/0:0:0:0:0:0:0:1:62856]" "" "" "" "" "" "" "1587013526124" "" "" "" "" "" "" "ROOT" "{""service"":"""",""logger_name"":""org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi""}"

  1. It's not connecting and continuing to execute the code in Java. So the application is not resuming. And I found this is Ignite server log

[10:37:57] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_CRITICAL_OPERATION_TIMEOUT, err=class o.a.i.IgniteException: Checkpoint read lock acquisition has been timed out.]] [10:37:57,739][SEVERE][exchange-worker-#46][GridCacheDatabaseSharedManager] Checkpoint read lock acquisition has been timed out. class org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$CheckpointReadLockTimeoutException: Checkpoint read lock acquisition has been timed out. at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.failCheckpointReadLock(GridCacheDatabaseSharedManager.java:1708) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1640) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initTopologies(GridDhtPartitionsExchangeFuture.java:1078) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:944) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3258) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3104) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119) at java.lang.Thread.run(Thread.java:748) [10:39:21,547][SEVERE][tcp-disco-msg-worker-[693d29cd 0:0:0:0:0:0:0:1%lo0:47501 crd]-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=db-checkpoint-thread, threadName=db-checkpoint-thread-#59, blockedFor=209s]

I am assuming here that since I am force shutting down the Java application which starts Ignite Client node, it's possible that there would be some topology imbalance that might happen.

Can someone please suggest, if at all I force kill the Client application, is there a correct way to restart the Client application such that it'll continue re-establishing the connection with Ignite server and continue working?


Solution

  • This scenario is possible when you have very long timeouts.

    You should not expect node to be dropped, and a new one to join, before all timeout runs off, such as, network timeout, socket write timeout, failure detection timeout. That, unless you do graceful shutdown.