Search code examples
javaapache-sparkpartition

Spark Custom Partitioning in java


I want to write a custom partitioner in spark and I'm working on java.

However I've noticed that the javaRDD class (or Dataset class) doesn't have a partitionBy(Partitioner) method like in scala. Only the javaPairRDD does. How am I supposed to partition RDDs or Datasets without this method ?


Solution

  • How am i supposed to partition RDDs or Datasets without this method?

    You suppose to not:

    • Datasets have no public concept of Partitioner. Instead you use repartition method which takes number of partitions and optional list of Columns. Partitioning method itself is not configurable (it is using hash partitioning with Murmur Hash).

    • RDDs, other than "PairRDDs" (JavaPairRDD in Java, RDD[(_, _)] in Scala) cannot be repartitioned at all. If you want to re-partition other RDD you have to convert it to PairRDD first. If you don't have a good choice you can use null as value and the record as key.