Search code examples
scalajoinrdd

Scala--How to get the same part of the two RDDS?


There are two RDD:

val rdd1 = sc.parallelize(List(("aaa", 1), ("bbb", 4), ("ccc", 3)))
val rdd2 = sc.parallelize(List(("aaa", 2),  ("bbb", 5), ("ddd", 2))) 

If I want to join those by the first field and get the result like:

List(("aaa", 1,2), ("bbb",4 ,5))

What should I code?Thx!!!!


Solution

  • You can join the RDDs and map the result to the wanted data structure:

    val resultRDD = rdd1.join(rdd2).map{
      case (k: String, (v1: Int, v2: Int)) => (k, v1, v2)
    }
    // resultRDD: org.apache.spark.rdd.RDD[(String, Int, Int)] = MapPartitionsRDD[53] at map at <console>:32
    
    resultRDD.collect
    // res1: Array[(String, Int, Int)] = Array((aaa,1,2), (bbb,4,5))