Search code examples
hadoop-yarngoogle-cloud-dataproc

Dataproc use GC_OPTS="-XX:+UseConcMarkSweepGC" for yarn?


Working with dataproc and i was exploring different configuration related to spark and yarn, and i found that dataproc includes GC_OPTS="-XX:+UseConcMarkSweepGC" as part of yarn env. configuration.

GC_OPTS="-XX:+UseConcMarkSweepGC"
# Log GC details to stdout, these will be in diagnostic tarballs.
GC_LOGGING_OPTS="-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails"
export YARN_TIMELINESERVER_OPTS="${GC_OPTS} ${GC_LOGGING_OPTS} ${YARN_TIMELINESERVER_OPTS}

Is there any specific needs for yarn performance in order to setup the garbage collector to the CMS collector instead of default options?


Solution

  • In certain cases with very high memory usage, stop-the-world garbage collection can potentially trigger timeouts in daemons talking to the ResourceManager or NameNode. This was actually observed in some Dataproc clusters prior to reconfiguring to use CMS GC.

    Optimal options may vary depending on the characteristics of the workload, but in general this approach is corroborated in other general Hadoop guidance, such as https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html