I have 2 dataframes in the databricks that I would like to perform a join on. The dataframes are as so:
df1
name count_1
a 3
b 4
c 3
df2
name count_2
a 3
b 4
d 8
I would like to join them and retain all the information (even if name d
is absent from df1). So the desired output would be as so:
name count_1 count_2
a 3 3
b 4 4
c 3 null
d null 8
So I have myself the following query:
joined_df = df1.join(df2, df1['name'] == df2['name'], "outer")
It performs the join I want, but it creates a new column called name
to accommodate the row with name d
. So currently I get this:
name count_1 name count_2
a 3 a 3
b 4 b 4
c 3 null null
null null d 8
So the question is: how can I perform a join to achieve my desired result instead of my current result?
Turns out I am just being silly. If I just used on
arg of join as intended, I would not have this problem. The solution is as so:
joined_df = df1.join(df2, 'name', "outer")