Search code examples
dataframepyspark

How to join two different datasets with different conditions with different columns?


I will join these two datasets with different conditions on different columns to obtain one dataset in Pyspark.

The first dataset df1:

rc1 rc2 rc3 resp
AB2 AB1 AB6 jean
AB4 AB3 AB7 shein
AB9 AB5 AB8 patrick

The second dataset df2:

Key description
AB1 Normal
AB4 Expand
AB3 small
AB6 Big
AB8 First
AB2 Dock
AB7 Missing
AB9 Package
AB5 Wrong

I will obtain in final dataset df join with df1 & df2:

rc1 rc2 rc3 resp
Dock Normal Big jean
Expand Small Missing shein
Package Wrong First Patrick

Solution

  • See the below implementation -

    df = (
           df1.join(df2, df1.rc1 == df2.Key, 'inner').drop("Key","rc1") 
              .withColumnRenamed('description', 'rc1') 
              .join(df2, df1.rc2 == df2.Key, 'inner').drop("Key","rc2") 
              .withColumnRenamed('description', 'rc2') 
              .join(df2, df1.rc3 == df2.Key, 'inner').drop("Key","rc3") 
              .withColumnRenamed('description', 'rc3') 
              .select("rc1","rc2","rc3","resp")
         )
    
    df.show()
    
    +-------+------+-------+-------+
    |    rc1|   rc2|    rc3|   resp|
    +-------+------+-------+-------+
    |   Dock|Normal|    Big|   jean|
    |Package| Wrong|  First|patrick|
    | Expand| Small|Missing|  shein|
    +-------+------+-------+-------+