I am new to Spark. I have been reading about Spark config and different properties to set so that we can optimize the job. But I am not sure how do I figure out what should I set ?
For example, I created a cluster of type r3.8x Large (1Master and 10 slaves)
How do I set :
spark.executor.memory
spark.driver.memory
spark.sql.shuffle.partitions
spark.default.parallelism
spark.driver.cores
spark.executor.cores
spark.memory.fraction
spark.executor.instances
Or should I just leave the default ? but leaving default makes my job very slow. My job has 3 group bas and 3 broadcasted maps.
Thanks
For tuning you application you need to know few things
1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created
Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.
2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Form Spark point of you
In spark-defaults.conf
you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.
Below are few Example you can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
For More details refer http://spark.apache.org/docs/latest/tuning.html
Hope this Helps!!