I am using pyspark.ml.feature.StopWordsRemover class on my pyspark dataframe. It has ID and Text column. In addition to default stop word list provided, I would like to add my own custom list to remove all numeric values from string.
I can see there is a method provided to add setStopWords for this class. I think I'm struggling with the proper syntax to use this method.
from pyspark.sql.functions import *
from pyspark.ml.feature import *
a = StopWordsRemover(inputCol="words", outputCol="filtered")
b = a.transform(df)
The above code gives me expected results in the filtered column but it only removes / stops standard words. I'm looking for a method to add my own custom list which would have more words and numeric values that I wish to filter.
You can specify it with this :
stopwordList = ["word1","word2","word3"]
StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)
The above solution replaces the original list of stop words with the list we supplied.
If you want to add your own stopwords in addition to the existing/predefined stopwords, then we need to append the list with the original list before passing into StopWordsRemover() as a parameter. We transform to set to remove any duplicate.
stopwordList = ["word1","word2","word3"]
stopwordList.extend(StopWordsRemover().getStopWords())
stopwordList = list(set(stopwordList))#optionnal
StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)