Apache Ignite 2.14: Getting "partition data has been lost" error for ignite-sys-atomic-cache

I have an Apache Ignite 2.14 cluster of 3 nodes running on Kubernetes. All my caches have one backup copy.

After enabling persistence on the default data region a couple of months ago, I started getting the exception CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) when one or two nodes restarted either as a result of deployment or for some other reason.

It was worrying but I learned to fix it by running control.sh --cache reset_lost_partitions cacheName.

This time after two nodes restarted due to some transient failure I started getting an error which I couldn't fix by running the mentioned command:

Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=ignite-sys-atomic-cache@default-ds-group, partition=985, key=UserKeyCacheObjectImpl [part=985, val=GridCacheInternalKeyImpl [name=alias, grpName=default-ds-group], hasValBytes=true]] at rg.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:214)

Looks like this time this issue involved a system cache ignite-sys-atomic-cache@default-ds-group. I guess it is related to the AtomicSequence object that I use in the application to get IDs generated. The error occurs exactly when I'm trying to use AtomicLong.

The question are:

Why it might happen?
Is it possible to fix it without destroying the cluster and reloading all the data from scratch (it would take a day or two).
How to prevent similar issues in the future?

Thank you in advance!

P.S. On GridGain Portal the following error is reported: Cache [default-ds-group] has zero partition copies.

Solution

To fix it you can run:

control.sh --cache reset_lost_partitions default-ds-group,default-volatile-ds-group@volatileDsMemPlc

Some system caches are partitioned and can loss the partitions as well as normal user caches.

The command above should help in your case.

As the work around you can change the backup factor and change the group:

https://www.gridgain.com/sdk/latest/javadoc/org/apache/ignite/configuration/AtomicConfiguration.html#setBackups-int- https://www.gridgain.com/sdk/latest/javadoc/org/apache/ignite/configuration/AtomicConfiguration.html#setGroupName-java.lang.String-

If the structure is volatile, it will have the group name "default-volatile-ds-group". Otherwise, if no group name is given, the name will be "default-ds-group". As far as I know it has some cache creation logic based on this.

Try the following example for your data structure:

    AtomicConfiguration cfg = new AtomicConfiguration().setGroupName("testgrp");

    cfg.setBackups(1);
    cfg.setCacheMode(CacheMode.PARTITIONED);

    IgniteAtomicReference<String> ref = ignite.atomicReference("ref", cfg, "d", true);

Regards, Andrei