I have a JavaPairRDD containing (Key, Value) which I want to group by Keys and make the "second column" a list with all values seen for that key. I am currently using the groupby()
function, which does the key aggrupation correctly but converts my values to an Iterable of Long. This is,
Key1 Iterable<Long>
Key2 Iterable<Long>
...
Is there any way to force this function to use a List of Longs instead of an Iterable object?
Key1 List<Long>
Key2 List<Long>
...
I read something about a function called combineByKey()
but I think this is not a use case. Probably I need to use reduceByKey but I am not seeing it. It should be something like this:
myRDD.reduceByKey((a,b) -> new ArrayList<Long>()) //and add b to a
In the end, I want to combine values to obtain a Key n, List<Long>
RDD.
Thank you for your time.
You can try something like this:
JavaPairRDD <String, List<long>> keyValuePairs = rdd.map(t -> {
return new Tuple2(t._1, Arrays.asList(new long[]{t._2}));
}).reduceByKey((a, b) -> {
a.addAll(b);
return a;
});
First, you map to convert the value into a list of longs. Then reduceByKey and combine the lists using addAll
method on arraylist.