I am trying to filter my pyspark data frame the following way: I have one column which contains long_text
and one column which contains numbers. If the long text contains the number
I want to keep the column.
I am trying to use the SQL LIKE
statement, but it seems I can't apply it to another column (here number
)
My code is the following:
from pyspark.sql.functions import regexp_extract, col, concat, lit
from pyspark.sql.types import *
PN_in_NC = (df
.filter(df.long_text.like(concat(lit("%"), df.number, lit("%"))))))
I get the following error:
Method like([class org.apache.spark.sql.Column]) does not exist
.
I tried multiple things to fix it (such as creating the '%number%'
string as column before the filter, not using lit
, using '%' + number + '%'
) but nothing worked. If LIKE
can't be applied to another column, is there another way to do this?
You can use the contains
function.
from pyspark.sql.functions import *
df1 = spark.createDataFrame([("hahaha the 3 is good",3),("i dont know about 3",2),("what is 5 doing?",5),\
("ajajaj 123",2),("7 dwarfs",1)], ["long_text","number"])
df1.filter(col("long_text").contains(col("number"))).show()
The long_text column should contain the number in the number column.
Output:
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+