apache-spark distributed-computing partitioning

decide no of partition in spark (running on YARN) based on executer ,cores and memory

How to decide no of partition in spark (running on YARN) based on executer, cores and memory. As i am new to spark so doesn't have much hands on real scenario

I know many things to consider to decide the partition but still any production general scenario explanation in detail will be very helpful.

Thanks in advance

Solution

One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster

the number of parition is recommended to be 2/4 * the number of cores.

so if you have 7 executor with 5 core , you can repartition between 7*5*2 = 70 and 7*5*4 = 140 partition

https://spark.apache.org/docs/latest/rdd-programming-guide.html