I have a requirement if a matching df2
has a null value it should match. Spark doesn't match rows with null by default.
df1
:
ID Name City EMAIL
1 John City A [email protected]
2 Mist City B [email protected]
3 Danny City C [email protected]
df2
:
ID Name City EMAIL
1 John City A [email protected]
2 null City B [email protected]
3 Danny City C [email protected]
df3 = df1.join(df2, on=["ID", "NAME", "CITY"])
display(df3)
Spark Output:
ID Name City EMAIL EMAIL
1 John City A [email protected] [email protected]
3 Danny City C [email protected] [email protected]
Expected Output:
ID Name City EMAIL EMAIL
1 John City A [email protected] [email protected]
2 Mist City B [email protected] [email protected]
3 Danny City C [email protected] [email protected]
As shown above, since ID
and CITY
match and NAME
have null values, the join should match and give the expected result.
And I can't remove NAME
on the joining column it should match NAME
, just if NAME
is null then those columns also should match.
Kindly help.
Try this:
df3 = df1.join(
df2,
on=[
df1.ID == df2.ID,
df1.City == df2.City,
(df1.Name == df2.Name) | df2.Name.isNull()
]
).select(df1.ID, df1.Name, df1.City, df1.EMAIL, df2.EMAIL)
df3.show()
Output:
+---+-----+------+-----------+-----------+
| ID| Name| City| EMAIL| EMAIL|
+---+-----+------+-----------+-----------+
| 1| John|City A|[email protected]|[email protected]|
| 2| Mist|City B|[email protected]|[email protected]|
| 3|Danny|City C|[email protected]|[email protected]|
+---+-----+------+-----------+-----------+