Search code examples
apache-sparkrdd

Perform join in spark only on one co-ordinate of pair key?


I have 3 RDDs:

  • 1st one is of form ((a,b),c).
  • 2nd one is of form (b,d).
  • 3rd one is of form (a,e).

How can I perform join in scala over these RDDs such that my final output is of the form ((a,b),c,d,e)?


Solution

  • you can do something like this:

    val rdd1: RDD[((A,B),C)]
    val rdd2: RDD[(B,D)]
    val rdd3: RDD[(A,E)]
    
    val tmp1 = rdd1.map {case((a,b),c) => (a, (b,c))}
    val tmp2 = tmp1.join(rdd3).map{case(a, ((b,c), e)) => (b, (a,c,e))}
    val res = tmp2.join(rdd2).map{case(b, ((a,c,e), d)) => ((a,b), c,d,e)}