Search code examples
apache-sparkrdd

spark creating num of partitions in RDD more than the data size


I am a noob and learning Pyspark now. My question about RDD is what happens when we try to create more partitions than the data size. E.g.,

data = sc.parallelize(range(5), partitions = 8)

I understand the intention of partitions is to effectively use the CPU cores of a cluster, and making too small partitions involves scheduling overhead than benefitting from distributed computing. What I am curious about is does spark still create 8 partitions here or optimize it to the number of cores? If it's creating 8 partitions then there is data replication in each partition?


Solution

  • My question about RDD is what happens when we try to create more partitions than the data size

    You can easily see how many partitions a given RDD has by using data.getNumPartitions. I tried creating RDD you have mentioned and running this command and it shows me there are 8 partitions. 4 partitions had one number each and rest 4 empty.

    If it's creating 8 partitions then there is data replication in each partition?

    You can try following code and check the executor output to see how many records are there in each partition. Note the first print statement in the below code. I have to return something as required by API so returning each element multiplied by 2.

    data.mapPartitionsWithIndex((x,y) => {println(s"partitions $x has ${y.length} records");y.map(a => a*2)}).collect.foreach(println)
    

    I got following output for the above code -

    partitions 0 has 0 records
    partitions 1 has 1 records
    partitions 2 has 0 records
    partitions 3 has 1 records
    partitions 4 has 0 records
    partitions 5 has 1 records
    partitions 6 has 0 records
    partitions 7 has 1 records
    

    I am curious about is does spark still create 8 partitions here or optimize it to the number of cores?

    Number of partitions defines how much data you want spark to process in one task. If there are 8 partitions and 4 virtual cores then spark would start running 4 tasks ( corresponding to 4 partitions) at once. As these tasks finishes, it will schedule remaining ones those cores.