Search code examples
javapersistenceignite

Ignite cluster repeatedly failing


Our ignite server is showing the following error.

We are not performing any queries or Ignite tasks, we are just using the put and get features of the caches.

We do have persistence enabled.

We did connect Ignite Visor, and it shows nothing anomalous within the cluster topology or with any of the caches.

Any insight into this error would be appreciated.

[15:01:03,943][SEVERE][sys-stripe-4-#5][FailureProcessor] A critical problem with persistence data structures was detected. Please make backup of persistence storage and WAL files for further analysis. Persistence storage path: null WAL path: /ignite/wal WAL archive path: /ignite/walarchive

[15:01:03,946][SEVERE][sys-stripe-4-#5][FailureProcessor] No deadlocked threads detected.
....

[15:01:04,135][SEVERE][sys-stripe-4-#5][] JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-2113909708, val2=1125985806188550]], msg=Runtime failure on search row: SearchRow [key=KeyCacheObjectImpl [part=20, val=2AC8D168CC86654C1879E6999D1CCFD7, hasValBytes=true], hash=741925420, cacheId=0]]

Edit, Server side stack trace:

[18:40:50,919][SEVERE][sys-stripe-4-#5][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-2113909708, val2=1125985806188550]], msg=Runtime failure on search row: SearchRow [key=KeyCacheObjectImpl [part=20, val=2AC8D168CC86654C1879E6999D1CCFD7, hasValBytes=true], hash=741925420, cacheId=0]]]]
class org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-2113909708, val2=1125985806188550]], msg=Runtime failure on search row: SearchRow [key=KeyCacheObjectImpl [part=20, val=2AC8D168CC86654C1879E6999D1CCFD7, hasValBytes=true], hash=741925420, cacheId=0]]
    at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
    at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1953)
    at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1765)
    at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1748)
    at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:2794)
    at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:441)
    at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2342)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2589)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2049)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1866)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1725)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3228)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:143)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:284)
    at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:279)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110)
    at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309)
    at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1908)
    at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1529)
    at org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
    at org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1422)
    at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
    at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:569)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Record is too long [capacity=134217728, size=134219738]
    at org.apache.ignite.internal.processors.cache.persistence.wal.SegmentedRingByteBuffer.offer0(SegmentedRingByteBuffer.java:214)
    at org.apache.ignite.internal.processors.cache.persistence.wal.SegmentedRingByteBuffer.offer(SegmentedRingByteBuffer.java:193)
    at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FileWriteHandleImpl.addRecord(FileWriteHandleImpl.java:243)
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:858)
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:811)
    at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4338)
    at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6492)
    at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6244)
    at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5928)
    at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4021)
    at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:3915)
    at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2042)
    at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1920)
    ... 27 more

Solution

  • The issue is caused by the insertion of the record whose size is greater than the actual WAL buffer size.

    Ignite uses a WAL buffer to store serialized WAL records before writing them to the WAL file. Due to that, the size of each WAL record should be less than the actual size of the WAL buffer.

    By default, the WAL buffer size is equal to the configured WAL segment size property. But in case IGNITE_WAL_MMAP is disabled then the WAL buffer size will be limited by the WAL buffer size property, which has the WAL segment size / 4 value as a default.

    As a workaround, you can try to increase the WAL buffer size using the properties mentioned above. More details regarding that can be found here.