Join two dataframes in scala spark using full join

I have two dataframes

Dataframe1:
 "a","x"
  1 , 2
  3 , 4
  2 , 5

Dataframe2:
 "a","y"
  1 , 3
  3 , 5
  7 , 6

I want to create a new DataFrame which is something like this

"a", "x" , "y"
  1    2     3
  3    4     5
  2    5     null
  7    null  6

I tried using joins, but it creates two new columns of the joined key. How should I proceed?

Solution

You could use join and specify the column that you want to join based on ("a" column), Spark will automatically remove unnecessary column after join

import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

object test extends App {
  val spark = SparkSession.builder().master("local[*]").getOrCreate()
  import spark.implicits._
  val df1 = Seq((1, 2), (3, 4), (2, 5)).toDF("a", "x")
  val df2 = Seq((1, 3), (3, 5), (7, 6)).toDF("a", "y")

  val resultDF = df1.join(df2, Seq("a"), "outer")
  resultDF.show()
//  +---+----+----+
//  |  a|   x|   y|
//  +---+----+----+
//  |  1|   2|   3|
//  |  3|   4|   5|
//  |  7|null|   6|
//  |  2|   5|null|
//  +---+----+----+

This is the format of Join method

def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): sql.DataFrame

Equi-join with another DataFrame using the given columns.

Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

Params:

right – Right side of the join operation.

usingColumns – Names of the columns to join on. This columns must exist on both sides.

joinType – Type of join to perform.