python pyspark apache-spark-sql text-mining stop-words

How to add custom stop word list to StopWordsRemover

I am using pyspark.ml.feature.StopWordsRemover class on my pyspark dataframe. It has ID and Text column. In addition to default stop word list provided, I would like to add my own custom list to remove all numeric values from string.

I can see there is a method provided to add setStopWords for this class. I think I'm struggling with the proper syntax to use this method.

from pyspark.sql.functions import *
from pyspark.ml.feature import * 

a = StopWordsRemover(inputCol="words", outputCol="filtered")
b = a.transform(df)

The above code gives me expected results in the filtered column but it only removes / stops standard words. I'm looking for a method to add my own custom list which would have more words and numeric values that I wish to filter.

Solution

You can specify it with this :

stopwordList = ["word1","word2","word3"]

StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)

A small Note:

The above solution replaces the original list of stop words with the list we supplied.
If you want to add your own stopwords in addition to the existing/predefined stopwords, then we need to append the list with the original list before passing into StopWordsRemover() as a parameter. We transform to set to remove any duplicate.

stopwordList = ["word1","word2","word3"] stopwordList.extend(StopWordsRemover().getStopWords())
stopwordList = list(set(stopwordList))#optionnal
StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)