Search code examples
pythonapache-sparkpysparksql-like

Pyspark: Filter data frame if column contains string from another column (SQL LIKE statement)


I am trying to filter my pyspark data frame the following way: I have one column which contains long_text and one column which contains numbers. If the long text contains the number I want to keep the column. I am trying to use the SQL LIKE statement, but it seems I can't apply it to another column (here number) My code is the following:

from pyspark.sql.functions import regexp_extract, col, concat, lit
from pyspark.sql.types import *
PN_in_NC = (df
        .filter(df.long_text.like(concat(lit("%"), df.number, lit("%"))))))

I get the following error: Method like([class org.apache.spark.sql.Column]) does not exist.

I tried multiple things to fix it (such as creating the '%number%' string as column before the filter, not using lit, using '%' + number + '%') but nothing worked. If LIKE can't be applied to another column, is there another way to do this?


Solution

  • You can use the contains function.

    from pyspark.sql.functions import *
    df1 = spark.createDataFrame([("hahaha the 3 is good",3),("i dont know about 3",2),("what is 5 doing?",5),\
    ("ajajaj 123",2),("7 dwarfs",1)], ["long_text","number"]) 
    df1.filter(col("long_text").contains(col("number"))).show()
    

    The long_text column should contain the number in the number column.

    Output:

    +--------------------+------+
    |           long_text|number|
    +--------------------+------+
    |hahaha the 3 is good|     3|
    |    what is 5 doing?|     5|
    |          ajajaj 123|     2|
    +--------------------+------+