I have a tokenized text in a df column. The code to remove the stopwords from it works, but I like to remove punctuation, numbers and special characters as well, without spelling them out. Like I want to be sure it also deletes numbers that are larger / tokenized as one token.
My current code is:
eng_stopwords = stopwords.words('english')
punctuation = ['.', ',', ';', ':', '!' #and so on]
complete_stopwords = punctuation + eng_stopwords
df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])
You can get the punctuations from the string module:
import string
print(string.punctuation)
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
eng_stopwords = stopwords.words('english')
punctuation = list(string.punctuation)
complete_stopwords = punctuation + eng_stopwords
df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])