Search code examples
nlpnltk

Struggling with removing stop words using nltk


I'm trying to remove the stop words from "I don't like ice cream." I have defined:

stop_words = set(nltk.corpus.stopwords.words('english'))

and the function

def stop_word_remover(text):
    return [word for word in text if word.lower() not in stop_words]

But when I apply the function to the string in question, I get this list:

[' ', 'n', '’', ' ', 'l', 'k', 'e', ' ', 'c', 'e', ' ', 'c', 'r', 'e', '.']

which, when joining the strings together as in ''.join(stop_word_remover("I don’t like ice cream.")), I get

' n’ lke ce cre.'

which is not what I was expecting.

Any tips on where have I gone wrong?


Solution

  • word for word in text iterates over characters of text (not over words!) you should change your code as below:

    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    from nltk.tokenize import word_tokenize 
    
    stop_words = set(nltk.corpus.stopwords.words('english'))
    
    def stop_word_remover(text):
        word_tokens = word_tokenize(text)
        word_list = [word for word in word_tokens if word.lower() not in stop_words]
        return " ".join(word_list)
    
    stop_word_remover("I don't like ice cream.")
    
    ## 'n't like ice cream .'