Search code examples
apache-sparkdistributed-computing

Use Sparks reduceByKey to convert value class


I have a large class named "DataClass" including the following members: "time", "value", "type", "name", "family". These are distributed as:

JavaPairRDD<key, DataClass> distributedRDD;

Currently what I do is group all these together in the following way:

JavaPairRDD<key, List<DataClass>> distributedRDD.groupByKey();

I currently only need to use two members of this large "DataClass", namely: "time" and "value". In order to improve performance i wanted to avoid shuffling this large data type, and perhaps try and perform the shuffle only on the desired members.

One of the things that came to mind is somehow use reduceByKey in order to reduce the values from "DataClass" to "SmallDataClass" (including only desired members) and shuffle on the smaller class.

Can anyone please help in performing this task?


Solution

  • The easiest way would be to transform the initial RDD into the desired form before applying the group operation:

    val timeValueRdd = rdd.map{case (k,v) => (k,(v.time, v.value))}
    val grouped = timeValueRdd.groupByKey
    

    There's a slightly more complex option using aggregateByKey which will be more efficient:

    val grouped = rdd.aggregateByKey(List[(String,String)].empty)({case (list,elem) => (elem.time,elem.value)::list}, (list1, list2) => list1 ++ list2)
    

    aggregateByKey works like a fold at the map side and the uses a reduce function (like reduce by key) to combine the results of each partition into one.