Search code examples
javaapache-sparkjava-pair-rdd

Spark grouping and then sorting (Java code)


I have a JavaPairRDD and need to group by the key and then sort it using a value inside the object MyObject.

Lets say MyObject is:

class MyObject {
    Integer order;
    String name;
}

Sample data:

1, {order:1, name:'Joseph'}
1, {order:2, name:'Tom'}
1, {order:3, name:'Luke'}
2, {order:1, name:'Alfred'}
2, {order:3, name:'Ana'}
2, {order:2, name:'Jessica'}
3, {order:3, name:'Will'}
3, {order:2, name:'Mariah'}
3, {order:1, name:'Monika'}

Expected result:

Partition 1:

1, {order:1, name:'Joseph'}
1, {order:2, name:'Tom'}
1, {order:3, name:'Luke'}

Partition 2

2, {order:1, name:'Alfred'}
2, {order:2, name:'Jessica'}
2, {order:3, name:'Ana'}

Partition 3:

3, {order:1, name:'Monika'}
3, {order:2, name:'Mariah'}
3, {order:3, name:'Will'}

I'm using the key to partition the RDD and then using MyObject.order to sort the data inside the partition.

My goal is to get only the k-first elements in each sorted partition and then reduce them to a value calculated by other MyObject attribute (AKA "the first N best of the group").

How can I do this?


Solution

  • You can use mapPartitions:

    JavaPairRDD<Long, MyObject> sortedRDD = rdd.groupBy(/* the first number */)
        .mapPartitionsToPair(x -> {
            List<Tuple2<Long, MyObject>> values = toArrayList(x);
            Collections.sort(values, (x, y) -> x._2.order - y._2.order);
    
            return values.iterator();
         }, true);
    

    Two highlights:

    • toArrayList takes an Iterator and returns ArrayList. You must implement it by yourself
    • important is to have true as the second argument of mapPartitionsToPair, because it will preserve partitioning