Search code examples
javaapache-sparkrdd

JavaPairRDD convert key-value into key-list


I have a JavaPairRDD containing (Key, Value) which I want to group by Keys and make the "second column" a list with all values seen for that key. I am currently using the groupby() function, which does the key aggrupation correctly but converts my values to an Iterable of Long. This is,

Key1 Iterable<Long>
Key2 Iterable<Long>
...

Is there any way to force this function to use a List of Longs instead of an Iterable object?

Key1 List<Long>
Key2 List<Long>
...

I read something about a function called combineByKey() but I think this is not a use case. Probably I need to use reduceByKey but I am not seeing it. It should be something like this:

myRDD.reduceByKey((a,b) -> new ArrayList<Long>()) //and add b to a 

In the end, I want to combine values to obtain a Key n, List<Long> RDD. Thank you for your time.


Solution

  • You can try something like this:

    JavaPairRDD <String, List<long>> keyValuePairs = rdd.map(t -> {
        return new Tuple2(t._1, Arrays.asList(new long[]{t._2}));
    }).reduceByKey((a, b) -> {
        a.addAll(b);
        return a;
    });
    

    First, you map to convert the value into a list of longs. Then reduceByKey and combine the lists using addAll method on arraylist.