I need to use StopWordsRemover
after I use RegexTokenizer
. However, I noticed that no stop words are being removed. I noticed when using Tokenizer
(as opposed to RegexTokenizer
) stop words are removed because Tokenizer
converts the output into an array of terms. RegexTokenizer
outputs only an array of strings with (not comma separated). Is there a fix for this?
Here is what my data looks like where "body" is the initial data. You can see "removedStopWords" is the same thing as "removeTags" column. This should not be the case:
Code:
val regexTokenizer = new RegexTokenizer() // first remove tags from string
.setInputCol("body")
.setOutputCol("removeTags")
.setPattern("<[^>]+>")
val stopWordsRemover = new StopWordsRemover()
.setInputCol(regexTokenizer.getOutputCol)
.setOutputCol("removedStopWords")
A tokenizer should take a stream of characters (e.g. a sentence) and break it up into smaller chunks (e.g. words).
For example, a Tokenizer
in Spark will split a sentence on whitespaces.
Here, you use the RegexTokenizer
to remove HTML tags (more accuratly, split the sentence into tokens based on tags). While this works, you need to make sure that the output is split into individual words as well.
To do that, you can add a condition to the regex to, in addition to tags, split on any whitespaces by adding \\s+
to the regex pattern:
val regexTokenizer = new RegexTokenizer() // removes tags from string and split into words
.setInputCol("body")
.setOutputCol("removeTags")
.setPattern("<[^>]+>|\\s+")
Now using StopWordsRemover
should work as expected.