i'm working on a spark application and i have a serious issues with the CGTime task time that result very high. I exported the logs and analysed it with GC easy.
Clusters Hardware: 1 Driver m4.2xlarge 16 vCore, 32 GiB memory, EBS only storage EBS Storage:32 GiB
15 cores: m4.2xlarge 16 vCore, 32 GiB memory, EBS only storage EBS Storage:32 GiB
Configuration
hadoop-env.export JAVA_HOME /usr/lib/jvm/java-1.8.0
mapred-site mapreduce.fileoutputcommitter.algorithm.version 2
mapred-site mapred.output.committer.class org.apache.hadoop.mapred.FileOutputCommitter
spark-defaults spark.default.parallelism 880
spark-defaults spark.executor.instances 44
spark-defaults spark.yarn.executor.memoryOverhead 3072
spark-defaults spark.executor.cores 10
spark-defaults spark.yarn.driver.memoryOverhead 3072
spark-defaults spark.driver.memory 18G
spark-defaults spark.driver.cores 10
spark-defaults spark.executor.memory 18G
spark-env.export JAVA_HOME /usr/lib/jvm/java-1.8.0
Inputs
Dimension 1.2 Tera of data.
Pseudo Instructions
1 read data 2 map to pair | row -> Tuple(row,1) 3 distinct
Logs Problem
Analysis Link
I'm not an expert of CG collector dynamics, can somebody help me to found the problem ?
Your spark executor(s) size is big and big executor(s) introduce heavy GC overhead.
Watch this video for how to choose an executor size and tune performance.
I recommend to watch full video: https://www.youtube.com/watch?v=OkyRdKahMpk
or at least from here to tune executor(s): https://youtu.be/OkyRdKahMpk?t=1308