Search code examples
pythonpandasstop-words

Stopwords on a DataFrame Column


I'm cleaning a excel file so I could present it on PowerBi. I want to eliminate the Stopwords of a specific column, and this is the code I'm using but it appears a problem. The stopwords I need to eliminate are at spanish.

Also I'm replacing the . and , to spaces to split the column and analize the information, if you know a easier way please let me know.

import nlkt
from nltk.corpus import stopwords
stop = stopwords.words('spanish')
df['Producto'] = df['Producto'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

df["Producto"] = df["Producto"].str.replace(",","")
df["Producto"] = df["Producto"].str.replace(".","")

df = df["Producto"].str.split(" ", expand = True)
print (df)

Solution

  • Here is a fast way to do it. I recreated a dataframe with some sample data:

    import re
    import nltk
    from nltk.corpus import stopwords
    
    pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('spanish')) + r')\b\s*')
    df_temp = pd.DataFrame({'Words': ["Uno", "Dos", "Tres", "Other", "los"]})
    df_temp['Words'] = df_temp['Words'].map(lambda x: pattern.sub('', str(x)))
    

    Output of df_temp:

    Words
    0   Uno
    1   Dos
    2   Tres
    3   Other
    4