Search code examples
scalaapache-sparkapache-spark-sql

Remove list elements in a dataframe in scala


I am new to scala and struggling with these use case. How can I remove the elements part of a list from a column in a dataframe?

I have a list of names and I need to remove the names if it is present in the dataframe.

I have dataframe like

utid|description
12342|my name is daniel
2345|my name is harry and i love sports
2122|his wife sofia is my schoolmate

and a list

list { "harry", "daniel" }

The output should be like

utid|description
12342|my name is 
2345|my name is  and i love sports
2122|his wife sofia is my schoolmate

Solution

  • Simplest way is to use regexp_replace inbuilt function as

    val list = List("harry","daniel")
    
    import org.apache.spark.sql.functions._
    df.withColumn("description", regexp_replace(col("description"), list.mkString("(", ")|(", ")"), "")).show(false)
    

    which should give you

    +-----+-------------------------------+
    |utid |description                    |
    +-----+-------------------------------+
    |12342|my name is                     |
    |2345 |my name is  and i love sports  |
    |2122 |his wife sofia is my schoolmate|
    +-----+-------------------------------+
    

    I hope the answer is helpful