Search code examples
apache-sparkpysparkpartitionhadoop-partitioning

Why is `getNumPartitions()` not giving me the correct number of partitions specified by `repartition`?


I have a textFile in and RDD like so: sc.textFile(<file_name>).

I try to repartition the RDD in order to speed up processing:

sc.repartition(<n>).

No matter what I put in for <n>, it does not seem to change, as indicated by:

RDD.getNumPartitions() always prints the same number (3) no matter what.

How do I change the number of partitions to increase performance?


Solution

  • That's because RDDs are immutable. You cannot change the partitions of an RDD, but you can create a new one with the desired number of partitions.

    scala> val a = sc.parallelize( 1 to 1000)
    a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at  parallelize at <console>:21
    scala> a.partitions.size
    res2: Int = 4
    scala> val b = a.repartition(6)
    b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at repartition at <console>:23
    scala> a.partitions.size
    res3: Int = 4
    scala> b.partitions.size
    res4: Int = 6