Search code examples
pythonnltk

Unable to implement nltk.stopwords


I am trying to remove stopwords in my data with nltk, but after several attempts I am unable to remove the stopwords. The tokenization part of my code works, but I am unable to understand why stopwords does not work.

def pre_process(text):
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W|_)+"," ",text)
    text=re.split("\W+",text)
    
    return text
text = dat['text'].apply(lambda x:pre_process(x))
nltk.download('stopwords')

def remove_stopwords(text):
    for word in text:
        if word in stopwords.words('english'):
            text.remove(word)
        return text

text_stopword = text.apply(lambda x:remove_stopwords(x))

The code should remove words such as 'the', but after running my csv through the code, that words such as 'the' is still present.

Current results:

text returns:

[tv, future, in, the, hands, of, viewers, with...

text_stopword returns:

[tv, future, in, the, hands, of, viewers, with...


Solution

  • Your return statement in remove_stopwords function is wrongly indented. Due to that function returns text right after the first iteration.

    Please go with:

    def remove_stopwords(text):
        for word in text:
            if word in stopwords.words('english'):
                text.remove(word)
        return text