Search code examples
apache-sparkjoinpysparkdatabricks

PySpark: NULL values in Join 2nd dataframe should match


I have a requirement if a matching df2 has a null value it should match. Spark doesn't match rows with null by default.

df1:

ID  Name    City    EMAIL
1   John    City A  [email protected]
2   Mist    City B  [email protected]
3   Danny   City C  [email protected]

df2:

ID  Name    City    EMAIL
1   John    City A  [email protected]
2   null    City B  [email protected]
3   Danny   City C  [email protected]
df3 = df1.join(df2, on=["ID", "NAME", "CITY"]) 
display(df3)

Spark Output:

ID  Name    City    EMAIL   EMAIL
1   John    City A  [email protected] [email protected]
3   Danny   City C  [email protected] [email protected]

Expected Output:

ID  Name    City    EMAIL           EMAIL
1   John    City A  [email protected] [email protected]
2   Mist    City B  [email protected] [email protected]
3   Danny   City C  [email protected] [email protected]

As shown above, since ID and CITY match and NAME have null values, the join should match and give the expected result.

And I can't remove NAME on the joining column it should match NAME, just if NAME is null then those columns also should match.

Kindly help.


Solution

  • Try this:

    df3 = df1.join(
        df2, 
        on=[
            df1.ID == df2.ID, 
            df1.City == df2.City, 
            (df1.Name == df2.Name) | df2.Name.isNull()
        ]
    ).select(df1.ID, df1.Name, df1.City, df1.EMAIL, df2.EMAIL)
    df3.show()
    

    Output:

    +---+-----+------+-----------+-----------+
    | ID| Name|  City|      EMAIL|      EMAIL|
    +---+-----+------+-----------+-----------+
    |  1| John|City A|[email protected]|[email protected]|
    |  2| Mist|City B|[email protected]|[email protected]|
    |  3|Danny|City C|[email protected]|[email protected]|
    +---+-----+------+-----------+-----------+