Search code examples
dataframescalaapache-spark

Join two dataframes in scala spark using full join


I have two dataframes

Dataframe1:
 "a","x"
  1 , 2
  3 , 4
  2 , 5

Dataframe2:
 "a","y"
  1 , 3
  3 , 5
  7 , 6

I want to create a new DataFrame which is something like this

"a", "x" , "y"
  1    2     3
  3    4     5
  2    5     null
  7    null  6

I tried using joins, but it creates two new columns of the joined key. How should I proceed?


Solution

  • You could use join and specify the column that you want to join based on ("a" column), Spark will automatically remove unnecessary column after join

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.SparkSession
    
    object test extends App {
      val spark = SparkSession.builder().master("local[*]").getOrCreate()
      import spark.implicits._
      val df1 = Seq((1, 2), (3, 4), (2, 5)).toDF("a", "x")
      val df2 = Seq((1, 3), (3, 5), (7, 6)).toDF("a", "y")
    
      val resultDF = df1.join(df2, Seq("a"), "outer")
      resultDF.show()
    //  +---+----+----+
    //  |  a|   x|   y|
    //  +---+----+----+
    //  |  1|   2|   3|
    //  |  3|   4|   5|
    //  |  7|null|   6|
    //  |  2|   5|null|
    //  +---+----+----+
    

    This is the format of Join method

    def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): sql.DataFrame
    

    Equi-join with another DataFrame using the given columns.

    Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    Params:

    right – Right side of the join operation.

    usingColumns – Names of the columns to join on. This columns must exist on both sides.

    joinType – Type of join to perform.