I have a Seq
and dataframe. The dataframe contains a column of array type. I am trying to remove the elements that are in the Seq
from the column.
For example:
val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")
+---------------------------------------------------+
|sorted_items |
+---------------------------------------------------+
|[flannel, and, for, s, shirts, sleeve, warm] |
|[3, 5, kitchenaid, s] |
|[5, 6, case, flip, inch, iphone, on, xs] |
|[almonds, chocolate, covered, dark, joe, s, the] |
|null |
|[] |
|[animation, book] |
Expected output:
+---------------------------------------------------+
|sorted_items |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm] |
|[3, 5, kitchenaid] |
|[5, 6, case, flip, inch, iphone, xs] |
|[almonds, chocolate, covered, dark, joe, the] |
|null |
|[] |
|[animation, book] |
How can this be done in an effective and optimized way?
Use the StopWordsRemover
from the MLlib package. It is possible to set custom stop words using the setStopWords
function. StopWordsRemover
will not handle null values so those will need to be dealt with before usage. It can be done as follows:
val df2 = df.withColumn("sorted_values", coalesce($"sorted_values", array()))
val remover = new StopWordsRemover()
.setStopWords(stop_words.toArray)
.setInputCol("sorted_values")
.setOutputCol("filtered")
val df3 = remover.transform(df2)