Search code examples
apache-sparkjoinpysparksubstringleft-join

Join the dataframe if value of one column exists as substring in another dataframe


I have a dataframe df1 like this:

enter image description here

and another dataframe df2 like this:

enter image description here

How could I join df2 with df1 using left join so that my output would look like the following?

enter image description here


Solution

  • You can split values in df1 and explode them before the join.

    df3 = df1.withColumn('Value', F.explode(F.split('Value', ';')))
    df4 = df2.join(df3, 'Value', 'left')
    

    Full example:

    from pyspark.sql import functions as F
    df1 = spark.createDataFrame([('apple;banana', 150), ('carrot', 20)], ['Value', 'Amount'])
    df2 = spark.createDataFrame([('apple',), ('orange',)], ['Value'])
    
    df3 = df1.withColumn('Value', F.explode(F.split('Value', ';')))
    df4 = df2.join(df3, 'Value', 'left')
    
    df4.show()
    # +------+------+
    # | Value|Amount|
    # +------+------+
    # | apple|   150|
    # |orange|  null|
    # +------+------+
    

    Dealing with nulls. If you have nulls in the column "Value" in both dataframes which you want to successfully join, you will need to use eqNullSafe equality. Using this condition would normally leave "Value" columns from both dataframes in the output dataframe. So to explicitly remove it, I suggest using alias on dataframes.

    from pyspark.sql import functions as F
    df1 = spark.createDataFrame([('apple;banana', 150), (None, 20)], ['Value', 'Amount'])
    df2 = spark.createDataFrame([('apple',), ('orange',), (None,)], ['Value'])
    
    df3 = df1.withColumn('Value', F.explode(F.coalesce(F.split('Value', ';'), F.array(F.lit(None)))))
    df4 = df2.alias('a').join(
        df3.alias('b'),
        df2.Value.eqNullSafe(df3.Value),
        'left'
    ).drop(F.col('b.Value'))
    
    df4.show()
    # +------+------+
    # | Value|Amount|
    # +------+------+
    # | apple|   150|
    # |  null|    20|
    # |orange|  null|
    # +------+------+