Search code examples
pythonapache-sparkpysparkapache-spark-mlstop-words

Remove specific stopwords Pyspark


New to Pyspark, I'd like to remove some french stopwords from pyspark column. Due to some constraint, I can't use NLTK/Spacy, StopWordsRemover is the only option that I got.

Below is what I have tried so far without success

from pyspark.ml import *
from pyspark.ml.feature import *

stop = ['EARL ', 'EIRL ', 'EURL ', 'SARL ', 'SA ', 'SAS ', 'SASU ', 'SCI ', 'SCM ', 'SCP ']
stop = [l.lower() for l in stop]
    
model = Pipeline(stages = [
        Tokenizer(inputCol = "name", outputCol="token"), 
        StopWordsRemover(inputCol="token", outputCol="stop", stopWords = stop),]).fit(df)
    
  result = model.transform(df)

Here is the expected output

|name          |stop          |
|2A            |2A            |
|AZEJADE       |AZEJADE       |
|MONAZTESANTOS |MONAZTESANTOS |
|SCI SANTOS    |SANTOS        |
|SA FCB        |FCB           |

Solution

  • The problem is that you have trailing spaces in your stop words. Also, you don't need to lowercase them unless you need the StopWordsRemover to be case sensitive. By default it is set to false, you can change that using the parameter caseSensitive.

    Note that when you are using Tokenizer the output will be in lowercase. If you need the output with the same case as input column name, then it might be preferable to simply split the column name by white spaces.

    Try with this:

    from pyspark.ml.feature import StopWordsRemover
    import pyspark.sql.functions as F
    
    stop = ['EARL', 'EIRL', 'EURL', 'SARL', 'SA', 'SAS', 'SASU', 'SCI', 'SCM', 'SCP']
    df = spark.createDataFrame([("2A",), ("AZEJADE",), ("MONAZTESANTOS",), ("SCI SANTOS",), ("SA FCB",)], ["name"])
    
    df = df.withColumn("tokens", F.split("name", "\\s+"))
    remover = StopWordsRemover(stopWords=stop, inputCol="tokens", outputCol="stop")
    
    result = remover.transform(df).select("name", F.array_join("stop", " ").alias("stop"))
    
    result.show()
    #+-------------+-------------+
    #|         name|         stop|
    #+-------------+-------------+
    #|           2A|           2A|
    #|      AZEJADE|      AZEJADE|
    #|MONAZTESANTOS|MONAZTESANTOS|
    #|   SCI SANTOS|       SANTOS|
    #|       SA FCB|          FCB|
    #+-------------+-------------+