java apache-spark garbage-collection amazon-emr

EMR Spark java Application GC Problems

i'm working on a spark application and i have a serious issues with the CGTime task time that result very high. I exported the logs and analysed it with GC easy.

Clusters Hardware: 1 Driver m4.2xlarge 16 vCore, 32 GiB memory, EBS only storage EBS Storage:32 GiB

15 cores: m4.2xlarge 16 vCore, 32 GiB memory, EBS only storage EBS Storage:32 GiB

Configuration

hadoop-env.export   JAVA_HOME   /usr/lib/jvm/java-1.8.0
mapred-site mapreduce.fileoutputcommitter.algorithm.version 2
mapred-site mapred.output.committer.class   org.apache.hadoop.mapred.FileOutputCommitter
spark-defaults  spark.default.parallelism   880
spark-defaults  spark.executor.instances    44
spark-defaults  spark.yarn.executor.memoryOverhead  3072
spark-defaults  spark.executor.cores    10
spark-defaults  spark.yarn.driver.memoryOverhead    3072
spark-defaults  spark.driver.memory 18G
spark-defaults  spark.driver.cores  10
spark-defaults  spark.executor.memory   18G
spark-env.export    JAVA_HOME   /usr/lib/jvm/java-1.8.0

Inputs

Dimension 1.2 Tera of data.

Pseudo Instructions

1 read data 2 map to pair | row -> Tuple(row,1) 3 distinct

Logs Problem

Consecutive Full GC
Long Pause
Application waiting for resources

Analysis Link

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMTAvMTIvLS1leGVjdXRvcjFzdGRvdXQudHh0LnppcC0tMTAtMzUtMjY=

I'm not an expert of CG collector dynamics, can somebody help me to found the problem ?

Solution

Your spark executor(s) size is big and big executor(s) introduce heavy GC overhead.

Watch this video for how to choose an executor size and tune performance.

I recommend to watch full video: https://www.youtube.com/watch?v=OkyRdKahMpk

or at least from here to tune executor(s): https://youtu.be/OkyRdKahMpk?t=1308