Search code examples
pythonpython-3.xpandasstop-words

Modify Stopword-Removal-Code to remove numbers as well


I have a tokenized text in a df column. The code to remove the stopwords from it works, but I like to remove punctuation, numbers and special characters as well, without spelling them out. Like I want to be sure it also deletes numbers that are larger / tokenized as one token.

My current code is:

eng_stopwords = stopwords.words('english')
punctuation = ['.', ',', ';', ':', '!' #and so on] 
complete_stopwords = punctuation + eng_stopwords
df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])

Solution

  • You can get the punctuations from the string module:

    import string
    print(string.punctuation)
    
    '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    
    eng_stopwords = stopwords.words('english')
    
    punctuation = list(string.punctuation) 
    
    complete_stopwords = punctuation + eng_stopwords
    
    df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])