Search code examples
scaladataframe

Search a dataframe from a list and add column to say found or not


This is my df with 2 columns:

utid  | description
------+-------------------------------------
12342 | my name is 123 amrud and nitesh
 2345 | my name is anil
 2122 | my name is 1234 mohan

and a list like list {"mohan","nitesh"}.

I need to search if an element from this list is present in the description column. If yes, then print "found" else print "not found" in a different column of the dataframe.

The list is far bigger than this of around 20k elements.

The output dataframe should be like this:

utid  | description                     | foundornot
------+---------------------------------+-----------
12342 | my name is 123 amrud and nitesh | found
 2345 | my name is xyz                  | not found
 2122 | my name is 1234 mohan           | found

Any help is welcome


Solution

  • You can simply define a udf function check for the condition and return on of the found or not found strings

    val list = List("mohan","nitesh")
    
    import org.apache.spark.sql.functions._
    def checkUdf = udf((strCol: String) => if (list.exists(strCol.contains)) "found" else "not found")
    
    df.withColumn("foundornot", checkUdf(col("description"))).show(false)
    

    Thats it and you should be getting

    +-----+-------------------------------+----------+
    |utid |description                    |foundornot|
    +-----+-------------------------------+----------+
    |12342|my name is 123 amrud and nitesh|found     |
    |2345 |my name is anil                |not found |
    |2122 |my name is 1234 mohan          |found     |
    +-----+-------------------------------+----------+
    

    I hope the answer is helpful