I have a large class named "DataClass" including the following members: "time", "value", "type", "name", "family". These are distributed as:
JavaPairRDD<key, DataClass> distributedRDD;
Currently what I do is group all these together in the following way:
JavaPairRDD<key, List<DataClass>> distributedRDD.groupByKey();
I currently only need to use two members of this large "DataClass", namely: "time" and "value". In order to improve performance i wanted to avoid shuffling this large data type, and perhaps try and perform the shuffle only on the desired members.
One of the things that came to mind is somehow use reduceByKey in order to reduce the values from "DataClass" to "SmallDataClass" (including only desired members) and shuffle on the smaller class.
Can anyone please help in performing this task?
The easiest way would be to transform the initial RDD into the desired form before applying the group operation:
val timeValueRdd = rdd.map{case (k,v) => (k,(v.time, v.value))}
val grouped = timeValueRdd.groupByKey
There's a slightly more complex option using aggregateByKey
which will be more efficient:
val grouped = rdd.aggregateByKey(List[(String,String)].empty)({case (list,elem) => (elem.time,elem.value)::list}, (list1, list2) => list1 ++ list2)
aggregateByKey
works like a fold
at the map side and the uses a reduce function (like reduce by key) to combine the results of each partition into one.