Search code examples
apache-sparkdistributed-computingpartitioning

decide no of partition in spark (running on YARN) based on executer ,cores and memory


How to decide no of partition in spark (running on YARN) based on executer, cores and memory. As i am new to spark so doesn't have much hands on real scenario

I know many things to consider to decide the partition but still any production general scenario explanation in detail will be very helpful.

Thanks in advance


Solution

  • One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster

    the number of parition is recommended to be 2/4 * the number of cores.

    so if you have 7 executor with 5 core , you can repartition between 7*5*2 = 70 and 7*5*4 = 140 partition

    https://spark.apache.org/docs/latest/rdd-programming-guide.html