Search code examples
scalaapache-sparkapache-spark-sqldistributed-computingrdd

Concatenating datasets of different RDDs in Apache spark using scala


Is there a way to concatenate datasets of two different RDDs in spark?

Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?

RDDs are of type spark.sql.SchemaRDD


Solution

  • I think you are looking for RDD.union

    val rddPart1 = ???
    val rddPart2 = ???
    val rddAll = rddPart1.union(rddPart2)
    

    Example (on Spark-shell)

    val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10)))
    val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15)))
    rdd1.union(rdd2).collect
    
    res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))