Search code examples
pythonapache-sparkpysparkapache-spark-sqlunion

How to perform union on two DataFrames with different amounts of columns in Spark?


I have 2 DataFrames:

Source data

I need union like this:

enter image description here

The unionAll function doesn't work because the number and the name of columns are different.

How can I do this?


Solution

  • In Scala you just have to append all missing columns as nulls.

    import org.apache.spark.sql.functions._
    
    // let df1 and df2 the Dataframes to merge
    val df1 = sc.parallelize(List(
      (50, 2),
      (34, 4)
    )).toDF("age", "children")
    
    val df2 = sc.parallelize(List(
      (26, true, 60000.00),
      (32, false, 35000.00)
    )).toDF("age", "education", "income")
    
    val cols1 = df1.columns.toSet
    val cols2 = df2.columns.toSet
    val total = cols1 ++ cols2 // union
    
    def expr(myCols: Set[String], allCols: Set[String]) = {
      allCols.toList.map(x => x match {
        case x if myCols.contains(x) => col(x)
        case _ => lit(null).as(x)
      })
    }
    
    df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*)).show()
    
    +---+--------+---------+-------+
    |age|children|education| income|
    +---+--------+---------+-------+
    | 50|       2|     null|   null|
    | 34|       4|     null|   null|
    | 26|    null|     true|60000.0|
    | 32|    null|    false|35000.0|
    +---+--------+---------+-------+
    

    Update

    Both temporal DataFrames will have the same order of columns, because we are mapping through total in both cases.

    df1.select(expr(cols1, total):_*).show()
    df2.select(expr(cols2, total):_*).show()
    
    +---+--------+---------+------+
    |age|children|education|income|
    +---+--------+---------+------+
    | 50|       2|     null|  null|
    | 34|       4|     null|  null|
    +---+--------+---------+------+
    
    +---+--------+---------+-------+
    |age|children|education| income|
    +---+--------+---------+-------+
    | 26|    null|     true|60000.0|
    | 32|    null|    false|35000.0|
    +---+--------+---------+-------+