Search code examples
arraysscalaapache-sparkdataframeseq

How to remove elements from an array Column in Spark?


I have a Seq and dataframe. The dataframe contains a column of array type. I am trying to remove the elements that are in the Seq from the column.

For example:

val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

    +---------------------------------------------------+
    |sorted_items                                       |
    +---------------------------------------------------+
    |[flannel, and, for, s, shirts, sleeve, warm]       |
    |[3, 5, kitchenaid, s]                              |
    |[5, 6, case, flip, inch, iphone, on, xs]           |
    |[almonds, chocolate, covered, dark, joe, s, the]   |
    |null                                               |
    |[]                                                 |
    |[animation, book]                                  |

Expected output:

+---------------------------------------------------+
|sorted_items                                       |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm]                    |
|[3, 5, kitchenaid]                                 |
|[5, 6, case, flip, inch, iphone, xs]               |
|[almonds, chocolate, covered, dark, joe, the]      |
|null                                               |
|[]                                                 |
|[animation, book]                                  |

How can this be done in an effective and optimized way?


Solution

  • Use the StopWordsRemover from the MLlib package. It is possible to set custom stop words using the setStopWords function. StopWordsRemover will not handle null values so those will need to be dealt with before usage. It can be done as follows:

    val df2 = df.withColumn("sorted_values", coalesce($"sorted_values", array()))
    
    val remover = new StopWordsRemover()
      .setStopWords(stop_words.toArray)
      .setInputCol("sorted_values")
      .setOutputCol("filtered")
    
    val df3 = remover.transform(df2)