I will join these two datasets with different conditions on different columns to obtain one dataset in Pyspark.
The first dataset df1:
rc1 | rc2 | rc3 | resp |
---|---|---|---|
AB2 | AB1 | AB6 | jean |
AB4 | AB3 | AB7 | shein |
AB9 | AB5 | AB8 | patrick |
The second dataset df2:
Key | description |
---|---|
AB1 | Normal |
AB4 | Expand |
AB3 | small |
AB6 | Big |
AB8 | First |
AB2 | Dock |
AB7 | Missing |
AB9 | Package |
AB5 | Wrong |
I will obtain in final dataset df join with df1 & df2:
rc1 | rc2 | rc3 | resp |
---|---|---|---|
Dock | Normal | Big | jean |
Expand | Small | Missing | shein |
Package | Wrong | First | Patrick |
See the below implementation -
df = (
df1.join(df2, df1.rc1 == df2.Key, 'inner').drop("Key","rc1")
.withColumnRenamed('description', 'rc1')
.join(df2, df1.rc2 == df2.Key, 'inner').drop("Key","rc2")
.withColumnRenamed('description', 'rc2')
.join(df2, df1.rc3 == df2.Key, 'inner').drop("Key","rc3")
.withColumnRenamed('description', 'rc3')
.select("rc1","rc2","rc3","resp")
)
df.show()
+-------+------+-------+-------+
| rc1| rc2| rc3| resp|
+-------+------+-------+-------+
| Dock|Normal| Big| jean|
|Package| Wrong| First|patrick|
| Expand| Small|Missing| shein|
+-------+------+-------+-------+