regex scala apache-spark tokenize apache-spark-ml

StopWords() not working after using RegexTokenizer() in Spark/Scala ML

I need to use StopWordsRemover after I use RegexTokenizer. However, I noticed that no stop words are being removed. I noticed when using Tokenizer (as opposed to RegexTokenizer) stop words are removed because Tokenizer converts the output into an array of terms. RegexTokenizer outputs only an array of strings with (not comma separated). Is there a fix for this?

Here is what my data looks like where "body" is the initial data. You can see "removedStopWords" is the same thing as "removeTags" column. This should not be the case:

Code:

val regexTokenizer = new RegexTokenizer() // first remove tags from string
  .setInputCol("body")
  .setOutputCol("removeTags")
  .setPattern("<[^>]+>")
val stopWordsRemover = new StopWordsRemover()
  .setInputCol(regexTokenizer.getOutputCol)
  .setOutputCol("removedStopWords")

Solution

A tokenizer should take a stream of characters (e.g. a sentence) and break it up into smaller chunks (e.g. words). For example, a Tokenizer in Spark will split a sentence on whitespaces.

Here, you use the RegexTokenizer to remove HTML tags (more accuratly, split the sentence into tokens based on tags). While this works, you need to make sure that the output is split into individual words as well. To do that, you can add a condition to the regex to, in addition to tags, split on any whitespaces by adding \\s+ to the regex pattern:

val regexTokenizer = new RegexTokenizer() // removes tags from string and split into words
  .setInputCol("body")
  .setOutputCol("removeTags")
  .setPattern("<[^>]+>|\\s+")

Now using StopWordsRemover should work as expected.