This is my df with 2 columns:
utid | description
------+-------------------------------------
12342 | my name is 123 amrud and nitesh
2345 | my name is anil
2122 | my name is 1234 mohan
and a list like list {"mohan","nitesh"}
.
I need to search if an element from this list is present in the description column. If yes, then print "found" else print "not found" in a different column of the dataframe.
The list is far bigger than this of around 20k elements.
The output dataframe should be like this:
utid | description | foundornot
------+---------------------------------+-----------
12342 | my name is 123 amrud and nitesh | found
2345 | my name is xyz | not found
2122 | my name is 1234 mohan | found
Any help is welcome
You can simply define a udf
function check for the condition and return on of the found
or not found
strings
val list = List("mohan","nitesh")
import org.apache.spark.sql.functions._
def checkUdf = udf((strCol: String) => if (list.exists(strCol.contains)) "found" else "not found")
df.withColumn("foundornot", checkUdf(col("description"))).show(false)
Thats it and you should be getting
+-----+-------------------------------+----------+
|utid |description |foundornot|
+-----+-------------------------------+----------+
|12342|my name is 123 amrud and nitesh|found |
|2345 |my name is anil |not found |
|2122 |my name is 1234 mohan |found |
+-----+-------------------------------+----------+
I hope the answer is helpful