I have a JavaPairRDD and need to group by the key and then sort it using a value inside the object MyObject.
Lets say MyObject is:
class MyObject {
Integer order;
String name;
}
Sample data:
1, {order:1, name:'Joseph'}
1, {order:2, name:'Tom'}
1, {order:3, name:'Luke'}
2, {order:1, name:'Alfred'}
2, {order:3, name:'Ana'}
2, {order:2, name:'Jessica'}
3, {order:3, name:'Will'}
3, {order:2, name:'Mariah'}
3, {order:1, name:'Monika'}
Expected result:
Partition 1:
1, {order:1, name:'Joseph'}
1, {order:2, name:'Tom'}
1, {order:3, name:'Luke'}
Partition 2
2, {order:1, name:'Alfred'}
2, {order:2, name:'Jessica'}
2, {order:3, name:'Ana'}
Partition 3:
3, {order:1, name:'Monika'}
3, {order:2, name:'Mariah'}
3, {order:3, name:'Will'}
I'm using the key to partition the RDD and then using MyObject.order to sort the data inside the partition.
My goal is to get only the k-first elements in each sorted partition and then reduce them to a value calculated by other MyObject attribute (AKA "the first N best of the group").
How can I do this?
You can use mapPartitions
:
JavaPairRDD<Long, MyObject> sortedRDD = rdd.groupBy(/* the first number */)
.mapPartitionsToPair(x -> {
List<Tuple2<Long, MyObject>> values = toArrayList(x);
Collections.sort(values, (x, y) -> x._2.order - y._2.order);
return values.iterator();
}, true);
Two highlights: