I have two dataframes
Dataframe1:
"a","x"
1 , 2
3 , 4
2 , 5
Dataframe2:
"a","y"
1 , 3
3 , 5
7 , 6
I want to create a new DataFrame which is something like this
"a", "x" , "y"
1 2 3
3 4 5
2 5 null
7 null 6
I tried using joins, but it creates two new columns of the joined key. How should I proceed?
You could use join and specify the column that you want to join based on ("a" column), Spark will automatically remove unnecessary column after join
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
object test extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val df1 = Seq((1, 2), (3, 4), (2, 5)).toDF("a", "x")
val df2 = Seq((1, 3), (3, 5), (7, 6)).toDF("a", "y")
val resultDF = df1.join(df2, Seq("a"), "outer")
resultDF.show()
// +---+----+----+
// | a| x| y|
// +---+----+----+
// | 1| 2| 3|
// | 3| 4| 5|
// | 7|null| 6|
// | 2| 5|null|
// +---+----+----+
This is the format of Join method
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): sql.DataFrame
Equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
Params:
right – Right side of the join operation.
usingColumns – Names of the columns to join on. This columns must exist on both sides.
joinType – Type of join to perform.