Search code examples
ignite

Ignite node carshed [ttl-cleanup-worker]


I have Ignite 2.7 and 5 node cluster. Over 40Mil data is generating and stored in ignite cache. I have set 3 days expiry. Today one of the ignite node halted and showing below error. Please help me to identify and resolve the issue.

[2019-09-11 07:45:59,570][ERROR][ttl-cleanup-worker-#170][root] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Unknown page type: 1 pageId: 000102210006d4ac]] java.lang.IllegalStateException: Unknown page type: 1 pageId: 000102210006d4ac at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.io(BPlusTree.java:5058) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$200(BPlusTree.java:90) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.nextPage(BPlusTree.java:5330) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.next(BPlusTree.java:5566) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2232) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2157) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:845) at org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:207) at org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:139) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) [2019-09-11 07:45:59,575][WARN ][ttl-cleanup-worker-#170][FailureProcessor] No deadlocked threads detected. [2019-09-11 07:46:40,831][WARN ][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 41233 milliseconds. [2019-09-11 07:46:40,831][ERROR][sys-stripe-0-#1][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=gri d-nio-worker-tcp-comm-23, blockedFor=41s] [2019-09-11 07:46:40,832][WARN ][sys-stripe-0-#1][G] Thread [name="grid-nio-worker-tcp-comm-23-#143", id=173, state=RUNNABLE, blockCnt=0, waitCnt=0]

Configuration if ignite is,

<?xml version="1.0" encoding="UTF-8"?>

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans
        http://www.springframework.org/schema/beans/spring-beans.xsd">
    <bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

        <!-- Enabling native persistance-->
        <property name="dataStorageConfiguration">
            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
                <property name="metricsEnabled" value="true"/>
                <property name="defaultDataRegionConfiguration">
                    <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                        <property name="persistenceEnabled" value="true"/>
                    </bean>
                </property>
                <property name="storagePath" value="/ignite_data/ignite/persistance"/>
                <property name="walPath" value="/ignite_data/ignite/wal"/>
                <property name="walArchivePath" value="/data/disk01/ignite/archive"/>
            </bean>
        </property>

        <!-- Enable authentication for ignite-->
		<property name="authenticationEnabled" value="true"/>


        <!-- Enabling expiry policy -->
        <property name="cacheConfiguration">
            <list>
                <bean class="org.apache.ignite.configuration.CacheConfiguration">
                    <property name="name" value="CACHE_L4_TRIGGER_NOTIFICATION"/>
                    <property name="expiryPolicyFactory">
                        <bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
                            <constructor-arg>
                                <bean class="javax.cache.expiry.Duration">
                                    <constructor-arg value="DAYS"/>
                                    <constructor-arg value="3"/>
                                </bean>
                            </constructor-arg>
                        </bean>
                    </property>
                </bean>
            </list>
        </property>


        <!-- Enable Ignite matric logged into logs in every 10 min-->
        <property name="gridLogger">
            <bean class="org.apache.ignite.logger.log4j.Log4JLogger">
                <constructor-arg type="java.lang.String" value="/home/trigger_be/apache-ignite-2.7.0/config/log4j.xml"/>
            </bean>
        </property>
        <property name="metricsLogFrequency" value="#{60 * 10 * 1000}"/>

        <!-- Set Cluster by giving IPs-->
        <property name="discoverySpi">
            <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                <property name="ipFinder">
                    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder">
                        <property name="addresses">
                            <list>
                                <value>172.16.5.36:49500..49509</value>
                                <value>172.16.5.37:49500..49509</value>
                                <value>172.16.5.38:49500..49509</value>
                                <value>172.16.5.39:49500..49509</value>
				                <value>172.16.5.40:49500..49509</value>
                            </list>
                        </property>
                    </bean>
                </property>
            </bean>
        </property>
    </bean>
</beans>


Solution

  • This looks like a data corruption issue. It's recommended to fully remove persistence data from this node and re-add it to cluster's baseline topology. Then data will be rebalanced, provided you have enough backups.

    This looks somewhat like the issue IGNITE-10767. Do you have MVCC (transactional SQL, TRANSACTIONAL_SNAPSHOT caches) enabled?