hadoop apache-spark mapreduce hadoop-yarn emr

How to set configurations to make Spark/Yarn job faster?

I am new to Spark. I have been reading about Spark config and different properties to set so that we can optimize the job. But I am not sure how do I figure out what should I set ?

For example, I created a cluster of type r3.8x Large (1Master and 10 slaves)

How do I set :

spark.executor.memory           
spark.driver.memory             
spark.sql.shuffle.partitions
spark.default.parallelism
spark.driver.cores              
spark.executor.cores             
spark.memory.fraction            
spark.executor.instances

Or should I just leave the default ? but leaving default makes my job very slow. My job has 3 group bas and 3 broadcasted maps.

Thanks

Solution

For tuning you application you need to know few things

1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created

Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.

2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application

Form Spark point of you

In spark-defaults.conf

you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.

Below are few Example you can tune this parameter based on your requirements

spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              5g
spark.executor.memory            3g
spark.executor.extraJavaOptions  -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions    -XX:MaxPermSize=6G -XX:+UseG1GC

For More details refer http://spark.apache.org/docs/latest/tuning.html

Hope this Helps!!