I want to write a custom partitioner in spark and I'm working on java.
However I've noticed that the javaRDD class (or Dataset class) doesn't have a partitionBy(Partitioner) method like in scala. Only the javaPairRDD does. How am I supposed to partition RDDs or Datasets without this method ?
How am i supposed to partition RDDs or Datasets without this method?
You suppose to not:
Datasets
have no public concept of Partitioner
. Instead you use repartition
method which takes number of partitions and optional list of Columns
. Partitioning method itself is not configurable (it is using hash partitioning with Murmur Hash).
RDDs
, other than "PairRDDs" (JavaPairRDD
in Java, RDD[(_, _)]
in Scala) cannot be repartitioned at all. If you want to re-partition other RDD
you have to convert it to PairRDD
first. If you don't have a good choice you can use null
as value and the record as key.