Search code examples
hadoopapache-sparkmapreducehadoop-yarnemr

How to set configurations to make Spark/Yarn job faster?


I am new to Spark. I have been reading about Spark config and different properties to set so that we can optimize the job. But I am not sure how do I figure out what should I set ?

For example, I created a cluster of type r3.8x Large (1Master and 10 slaves)

How do I set :

spark.executor.memory           
spark.driver.memory             
spark.sql.shuffle.partitions
spark.default.parallelism
spark.driver.cores              
spark.executor.cores             
spark.memory.fraction            
spark.executor.instances

Or should I just leave the default ? but leaving default makes my job very slow. My job has 3 group bas and 3 broadcasted maps.

Thanks


Solution

  • For tuning you application you need to know few things

    1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created

    Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.

    2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application

    Form Spark point of you

    In spark-defaults.conf

    you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.

    Below are few Example you can tune this parameter based on your requirements

    spark.serializer                 org.apache.spark.serializer.KryoSerializer
    spark.driver.memory              5g
    spark.executor.memory            3g
    spark.executor.extraJavaOptions  -XX:MaxPermSize=2G -XX:+UseG1GC
    spark.driver.extraJavaOptions    -XX:MaxPermSize=6G -XX:+UseG1GC
    

    For More details refer http://spark.apache.org/docs/latest/tuning.html

    Hope this Helps!!