We use Ignite 2.7.6 in both server and client modes: two server and six clients.
At first, each app node with client Ignite inside had 2G heap. Each Ignite server node had 24G offheap and 2G heap.
With last app update we introduced new functionality which required about 2000 caches of 20 entires (user groups). Cache entry has small size up to 10 integers inside.
These caches are created via ignite.getOrCreateCache(name)
method, so they have default cache configurations (off-heap, partitioned).
But in an hour after update we got OOM error on a server node:
[00:59:55,628][SEVERE][sys-#44759][GridDhtPartitionsExchangeFuture] Failed to notify listener: o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2@3287dcbd
java.lang.OutOfMemoryError: Java heap space
Heaps are increased now to 16G on Ignite server nodes and to 12G on app nodes.
As we can see, all server nodes have high CPU load about 250% now (20% before update) and long G1 Young Gen pauses up to 5 millisecond (300 microseconds before update).
Server config is:
<beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">
<bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="workDirectory" value="/opt/qwerty/ignite/data"/>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
</bean>
</property>
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>
<property name="pageEvictionMode" value="RANDOM_LRU"/>
</bean>
</property>
</bean>
</property>
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="localAddress" value="host-1.qwerty.srv"/>
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>host-1.qwerty.srv:47500</value>
<value>host-2.qwerty.srv:47500</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
<property name="communicationSpi">
<bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
<property name="localAddress" value="host-1.qwerty.srv"/>
</bean>
</property>
</bean>
</beans>
In memory dump of an Ignite server node we see a lot of org.apache.ignite.internal.marshaller.optimized.OptimizedObjectStreamRegistry$StreamHolder
of 21Mb
Memory leak report shows:
Problem Suspect 1
One instance of "org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager" loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100" occupies 529 414 776 (10,39 %) bytes. The memory is accumulated in one instance of "java.util.LinkedList" loaded by "<system class loader>".
Keywords
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100
java.util.LinkedList
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager
Problem Suspect 2
384 instances of "org.apache.ignite.thread.IgniteThread", loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100" occupy 3 023 380 000 (59,34 %) bytes.
Keywords
org.apache.ignite.thread.IgniteThread
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100
Problem Suspect 3
1 023 instances of "org.apache.ignite.internal.processors.cache.CacheGroupContext", loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100" occupy 905 077 824 (17,76 %) bytes.
Keywords
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100
org.apache.ignite.internal.processors.cache.CacheGroupContext
The question is what's wrong we have done? What can we tune? Maybe the problem in our code, but how to identify where it is?
2000 caches is a lot. One cache probably takes up to 40M in data structures.
I recommend at least using the same cacheGroup
for all caches of the similar purpose and composition, to share some of these data structures.