PySpark: NULL values in Join 2nd dataframe should match

I have a requirement if a matching df2 has a null value it should match. Spark doesn't match rows with null by default.

df1:

ID  Name    City    EMAIL
1   John    City A  [email protected]
2   Mist    City B  [email protected]
3   Danny   City C  [email protected]

df2:

ID  Name    City    EMAIL
1   John    City A  [email protected]
2   null    City B  [email protected]
3   Danny   City C  [email protected]

df3 = df1.join(df2, on=["ID", "NAME", "CITY"]) 
display(df3)

Spark Output:

ID  Name    City    EMAIL   EMAIL
1   John    City A  [email protected] [email protected]
3   Danny   City C  [email protected] [email protected]

Expected Output:

ID  Name    City    EMAIL           EMAIL
1   John    City A  [email protected] [email protected]
2   Mist    City B  [email protected] [email protected]
3   Danny   City C  [email protected] [email protected]

As shown above, since ID and CITY match and NAME have null values, the join should match and give the expected result.

And I can't remove NAME on the joining column it should match NAME, just if NAME is null then those columns also should match.

Kindly help.

Solution

Try this:

df3 = df1.join(
    df2, 
    on=[
        df1.ID == df2.ID, 
        df1.City == df2.City, 
        (df1.Name == df2.Name) | df2.Name.isNull()
    ]
).select(df1.ID, df1.Name, df1.City, df1.EMAIL, df2.EMAIL)
df3.show()

Output:

+---+-----+------+-----------+-----------+
| ID| Name|  City|      EMAIL|      EMAIL|
+---+-----+------+-----------+-----------+
|  1| John|City A|[email protected]|[email protected]|
|  2| Mist|City B|[email protected]|[email protected]|
|  3|Danny|City C|[email protected]|[email protected]|
+---+-----+------+-----------+-----------+