I have a textFile
in and RDD like so: sc.textFile(<file_name>)
.
I try to repartition the RDD in order to speed up processing:
sc.repartition(<n>)
.
No matter what I put in for <n>
, it does not seem to change, as indicated by:
RDD.getNumPartitions()
always prints the same number (3)
no matter what.
How do I change the number of partitions to increase performance?
That's because RDDs are immutable. You cannot change the partitions of an RDD, but you can create a new one with the desired number of partitions.
scala> val a = sc.parallelize( 1 to 1000)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> a.partitions.size
res2: Int = 4
scala> val b = a.repartition(6)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at repartition at <console>:23
scala> a.partitions.size
res3: Int = 4
scala> b.partitions.size
res4: Int = 6