We are using Pivotal Gemfire as a cache for our data. Recently we migrated from gemfire 8.2.1 to 9.5.1 with exactly same regions, data and indexes. But the indexes creation on particularly one region is taking too much of time which has entrycount of 7284500. We have used Spring data gemfire v2.4.1.RELEASE for defining the cache server. Below is the configuration of the problematic region:
<gfe:replicated-region id="someRegion"
shortcut="REPLICATE_PERSISTENT" concurrency-level=100
persistent="true" disk-synchronous="true" statistics="true">
<gfe:eviction action="OVERFLOW_TO_DISK" type="ENTRY_COUNT"
threshold=1000></gfe:eviction>
</gfe:replicated-region>
Below are the index definitions:
<gfe:index id="someRegion_idx1" expression="o1.var1" from="/someRegion o1" />
<gfe:index id="someRegion_idx2" expression="o2.var2" from="/someRegion o2"/>
<gfe:index id="someRegion_idx3" expression="o3.var3" from="/someRegion o3"/>
<gfe:index id="someRegion_idx4" expression="o4.var4" from="/someRegion o4"/>
<gfe:index id="someRegion_idx5" expression="o5.var5" from="/someRegion o5"/>
<gfe:index id="someRegion_idx6" expression="o6.var6" from="/someRegion o6"/>
<gfe:index id="someRegion_idx7" expression="o7.var7" from="/someRegion o7"/>
<gfe:index id="someRegion_idx8" expression="o8.var8" from="/someRegion o8"/>
Below is the cache defination:
<gfe:cache
properties-ref="gemfireProperties"
close="true"
critical-heap-percentage=85
eviction-heap-percentage=75
pdx-serializer-ref="pdxSerializer"
pdx-persistent="true"
pdx-read-serialized="true"
pdx-ignore-unread-fields="false" />
Below are the Java parameters:
java -Xms50G -Xmx80G -XX:+UseConcMarkSweepGC
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=70
-XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark
-XX:+UseParNewGC -XX:+UseLargePages
-XX:+DisableExplicitGC
-Ddw.appname=$APPNAME \
-Dgemfire.Query.VERBOSE=true \
-Dgemfire.QueryService.allowUntrustedMethodInvocation=true \
-DDistributionManager.MAX_THREADS=20 \
-DDistributionManager.MAX_FE_THREADS=10 \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=11809 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dconfig=/config/location/ \
com.my.package.cacheServer
When run without XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC
, we used to get the following error while indexes were applied:
org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests gemfire pivotal
We tried increasing the member-timeout
property from 5000 to 300000 but the same issue persisted.
After adding the above GC related java parameters, every index is taking around 24 minutes to get applied, but this time without errors. This is resulting server to take too much time to come up along with around 15 other regions. There is no such issue faced with other regions.(The region in question has the largest data count. Other regions have around 500K to 3M entry count)
There are a few things I see from your configuration that need to be adjusted. For some of this I will need to speculate, as I do not know your general tenured heap consumption.
Set NewSize and MaxNewSize to 9gb Set SurvivorRatio to 1 Set TargetSurvivorRatio to 85 Add the PrintTenuringDistribution flag to help us fine tune.
I am not a fan of the Scavenge flags, as they cause even more thrashing when not finely tuned. For now, you can keep them in, but I would remove ScavengeBeforeFullGC and ScavengeBeforeRemark. Keep the DisableExplicitGC flag. More importantly, while I read that your behavior changes based upon using these flags, finding a correlation between index creation time and these flags is a stretch. What is more likely is that members are becoming unresponsive due to a bad heap configuration, so let's solve that.
With respect to your eviction configuration, I see you say that you have 7+ million entries in this "problem" region, and yet you have an eviction algorithm where you overflow to disk all but the first 1000 ?? Why? Overflow to disk is something to use to handle bursts of activity, not as a "given". Perhaps you are having disk issues driving some aspects of your issue. Perhaps needing to access all of these entries on disk is a problem. Have you experienced this issue when all entries are actually in the heap?
Enable GC logs with all the flags set to print gc details, datestamps, etc.
If you do not yet have statistics enabled for GemFire, please enable those as well.
If you are finding the member-timeout is insufficient, it is likely that you have issues in your environment. Those should be addressed rather than thinking to increase the member-timeout to cover up those issues.