Search code examples
pythonpysparkapache-spark-sqltext-miningstop-words

How to add custom stop word list to StopWordsRemover


I am using pyspark.ml.feature.StopWordsRemover class on my pyspark dataframe. It has ID and Text column. In addition to default stop word list provided, I would like to add my own custom list to remove all numeric values from string.

I can see there is a method provided to add setStopWords for this class. I think I'm struggling with the proper syntax to use this method.

from pyspark.sql.functions import *
from pyspark.ml.feature import * 

a = StopWordsRemover(inputCol="words", outputCol="filtered")
b = a.transform(df)

The above code gives me expected results in the filtered column but it only removes / stops standard words. I'm looking for a method to add my own custom list which would have more words and numeric values that I wish to filter.


Solution

  • You can specify it with this :

    stopwordList = ["word1","word2","word3"]
    
    StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)
    

    A small Note:

    The above solution replaces the original list of stop words with the list we supplied.
    If you want to add your own stopwords in addition to the existing/predefined stopwords, then we need to append the list with the original list before passing into StopWordsRemover() as a parameter. We transform to set to remove any duplicate.

    stopwordList = ["word1","word2","word3"] stopwordList.extend(StopWordsRemover().getStopWords())
    stopwordList = list(set(stopwordList))#optionnal
    StopWordsRemover(inputCol="words", outputCol="filtered" ,stopWords=stopwordList)