Search code examples
apache-sparkhadoop2

Combining two JavaRDD for next reducer job


I am trying to combine two JavaPairRDD, so that I can do a reduceByKey job on the combined dataset, like below:


JavaPairRDD data1 = ...

JavaPairRDD data2 = ...

I want to have a new dataset which contains both data1 and data2, something like:

JavaPairRDD data_total = (data1 + data2)

So that I can do a reduce by key on the combined dataset:

JavaPairRDD output = data_total.reduceByKey(... my reduce function ...);


What's the best way to combine data1 and data2? Or what's the best approach to this problem?

Thanks a lot!


Solution

  • You can use union:

    // Return the union of this RDD and another one.
    union(JavaPairRDD<K,V> other)