JVM will be halted immediately due to the failure: SYSTEM_WORKER_TERMINATION - Failed to find security context for subject with given ID

I'm observing strage failures of our Apache Ignite cluster. Ignite version is 2.13.0 running with "OpenJDK 64-Bit Server VM Zulu11.52+13-CA (build 11.0.13+8-LTS, mixed mode)".

The cluster consists of 3 baseline server nodes, persistance enabled. Once a week each of them is rebootet after receiving automatic operating system updates.

On 28.3. at 19:03 server number 2 was rebooted, came backup and after 10 minutes the Ignite service crashed with the following error after logging a threaddump:

[19:13:17,302][SEVERE][sys-stripe-5-#6][] JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Failed to find security context for subject with given ID : 74f979dd-342d-4055-bda4-58535b8ab282]]

At the same time, the Ignite service on server number 3 crashed the same way:

[19:13:17,606][SEVERE][sys-stripe-5-#6][] JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Failed to find security context for subject with given ID : 74f979dd-342d-4055-bda4-58535b8ab282]]

At this point one 1 Ignite server node remained active.

One day later server number 3 was automatically rebooted by the maintenance job, and when it came back up Ignite joined the cluster again without crashing again.

Logs server 2:
https://pastebin.com/XJK8kPyU
Logs server 3:
https://pastebin.com/YrWvgL1K

Any idea why Ignite just randomly crashes?

Regards, Sven

Update:

This failures caused the remaining node, and even the whole cluster after a restart of the failed nodes, to go into an unusable state. Our application was unable to work with Ignite, showing the following failure:

Caused by: org.apache.ignite.IgniteException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=datastructures_ATOMIC_PARTITIONED_1@default-ds-group, partition=299, key=GridCacheQueueItemKey [queueId=d6743312681-6ddd800e-39e6-473f-915a-39c576ef32be, queueName=cluster_processing_6-queue, idx=1322]]

It seems the internal cache was defect because of lost paritions caused by the node failures. The only way we could find to get the cluster and our application back running correctly was to fully shutdown all Ignite servers and clients, then restart everything. I was not able to execute a reset lost partitions on that cache. Is it possible to configure that system cache to have 2 backups instead of just 1? This could prevent the situation in case we have a failure of 2 nodes at the same time again.

This behavior breaks the idea of having a high available cluster.

Solution

The observed nodes' failure is related to the known issue. In the case of enabled security, there is a chance that the server node will receive a communication message from the joining node while the topology snapshot has not yet been updated, which leads to the server node failure.

According to the provided logs, in your particular case, it was related to the join of the client node with id=74f979dd-342d-4055-bda4-58535b8ab282:

[19:13:16,989][SEVERE][sys-stripe-5-#6][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Failed to find security context for subject with given ID : 74f979dd-342d-4055-bda4-58535b8ab282]]
...
[19:13:16,995][INFO][disco-event-worker-#62][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=74f979dd-342d-4055-bda4-58535b8ab282, consistentId=74f979dd-342d-4055-bda4-58535b8ab282, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 10.105.178.172, 127.0.0.1], sockAddrs=HashSet [b2bimcpapp2.internal.domain/10.105.178.172:0, 0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0], discPort=0, order=1265, intOrder=647, lastExchangeTime=1680023596965, loc=false, ver=2.13.0#20220420-sha1:551f6ece, isClient=true]