I'm cleaning a excel file so I could present it on PowerBi. I want to eliminate the Stopwords of a specific column, and this is the code I'm using but it appears a problem. The stopwords I need to eliminate are at spanish.
Also I'm replacing the . and , to spaces to split the column and analize the information, if you know a easier way please let me know.
import nlkt
from nltk.corpus import stopwords
stop = stopwords.words('spanish')
df['Producto'] = df['Producto'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df["Producto"] = df["Producto"].str.replace(",","")
df["Producto"] = df["Producto"].str.replace(".","")
df = df["Producto"].str.split(" ", expand = True)
print (df)
Here is a fast way to do it. I recreated a dataframe with some sample data:
import re
import nltk
from nltk.corpus import stopwords
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('spanish')) + r')\b\s*')
df_temp = pd.DataFrame({'Words': ["Uno", "Dos", "Tres", "Other", "los"]})
df_temp['Words'] = df_temp['Words'].map(lambda x: pattern.sub('', str(x)))
Output of df_temp:
Words
0 Uno
1 Dos
2 Tres
3 Other
4