I have a dataframe df1
like this:
and another dataframe df2
like this:
How could I join df2
with df1
using left join so that my output would look like the following?
You can split
values in df1
and explode
them before the join.
df3 = df1.withColumn('Value', F.explode(F.split('Value', ';')))
df4 = df2.join(df3, 'Value', 'left')
Full example:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([('apple;banana', 150), ('carrot', 20)], ['Value', 'Amount'])
df2 = spark.createDataFrame([('apple',), ('orange',)], ['Value'])
df3 = df1.withColumn('Value', F.explode(F.split('Value', ';')))
df4 = df2.join(df3, 'Value', 'left')
df4.show()
# +------+------+
# | Value|Amount|
# +------+------+
# | apple| 150|
# |orange| null|
# +------+------+
Dealing with nulls. If you have nulls in the column "Value" in both dataframes which you want to successfully join, you will need to use eqNullSafe
equality. Using this condition would normally leave "Value" columns from both dataframes in the output dataframe. So to explicitly remove it, I suggest using alias
on dataframes.
from pyspark.sql import functions as F
df1 = spark.createDataFrame([('apple;banana', 150), (None, 20)], ['Value', 'Amount'])
df2 = spark.createDataFrame([('apple',), ('orange',), (None,)], ['Value'])
df3 = df1.withColumn('Value', F.explode(F.coalesce(F.split('Value', ';'), F.array(F.lit(None)))))
df4 = df2.alias('a').join(
df3.alias('b'),
df2.Value.eqNullSafe(df3.Value),
'left'
).drop(F.col('b.Value'))
df4.show()
# +------+------+
# | Value|Amount|
# +------+------+
# | apple| 150|
# | null| 20|
# |orange| null|
# +------+------+