apache-spark pyspark partition hadoop-partitioning

Why is `getNumPartitions()` not giving me the correct number of partitions specified by `repartition`?

I have a textFile in and RDD like so: sc.textFile(<file_name>).

I try to repartition the RDD in order to speed up processing:

sc.repartition(<n>).

No matter what I put in for <n>, it does not seem to change, as indicated by:

RDD.getNumPartitions() always prints the same number (3) no matter what.

How do I change the number of partitions to increase performance?

Solution

That's because RDDs are immutable. You cannot change the partitions of an RDD, but you can create a new one with the desired number of partitions.

scala> val a = sc.parallelize( 1 to 1000)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at  parallelize at <console>:21
scala> a.partitions.size
res2: Int = 4
scala> val b = a.repartition(6)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at repartition at <console>:23
scala> a.partitions.size
res3: Int = 4
scala> b.partitions.size
res4: Int = 6