There are two RDD:
val rdd1 = sc.parallelize(List(("aaa", 1), ("bbb", 4), ("ccc", 3)))
val rdd2 = sc.parallelize(List(("aaa", 2), ("bbb", 5), ("ddd", 2)))
If I want to join those by the first field and get the result like:
List(("aaa", 1,2), ("bbb",4 ,5))
What should I code?Thx!!!!
You can join
the RDDs and map
the result to the wanted data structure:
val resultRDD = rdd1.join(rdd2).map{
case (k: String, (v1: Int, v2: Int)) => (k, v1, v2)
}
// resultRDD: org.apache.spark.rdd.RDD[(String, Int, Int)] = MapPartitionsRDD[53] at map at <console>:32
resultRDD.collect
// res1: Array[(String, Int, Int)] = Array((aaa,1,2), (bbb,4,5))