How to remove elements from an array Column in Spark?

I have a Seq and dataframe. The dataframe contains a column of array type. I am trying to remove the elements that are in the Seq from the column.

For example:

val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

    +---------------------------------------------------+
    |sorted_items                                       |
    +---------------------------------------------------+
    |[flannel, and, for, s, shirts, sleeve, warm]       |
    |[3, 5, kitchenaid, s]                              |
    |[5, 6, case, flip, inch, iphone, on, xs]           |
    |[almonds, chocolate, covered, dark, joe, s, the]   |
    |null                                               |
    |[]                                                 |
    |[animation, book]                                  |

Expected output:

+---------------------------------------------------+
|sorted_items                                       |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm]                    |
|[3, 5, kitchenaid]                                 |
|[5, 6, case, flip, inch, iphone, xs]               |
|[almonds, chocolate, covered, dark, joe, the]      |
|null                                               |
|[]                                                 |
|[animation, book]                                  |

How can this be done in an effective and optimized way?

Solution

Use the StopWordsRemover from the MLlib package. It is possible to set custom stop words using the setStopWords function. StopWordsRemover will not handle null values so those will need to be dealt with before usage. It can be done as follows:

val df2 = df.withColumn("sorted_values", coalesce($"sorted_values", array()))

val remover = new StopWordsRemover()
  .setStopWords(stop_words.toArray)
  .setInputCol("sorted_values")
  .setOutputCol("filtered")

val df3 = remover.transform(df2)