I'm trying to remove the stop words from "I don't like ice cream." I have defined:
stop_words = set(nltk.corpus.stopwords.words('english'))
and the function
def stop_word_remover(text):
return [word for word in text if word.lower() not in stop_words]
But when I apply the function to the string in question, I get this list:
[' ', 'n', '’', ' ', 'l', 'k', 'e', ' ', 'c', 'e', ' ', 'c', 'r', 'e', '.']
which, when joining the strings together as in ''.join(stop_word_remover("I don’t like ice cream."))
, I get
' n’ lke ce cre.'
which is not what I was expecting.
Any tips on where have I gone wrong?
word for word in text
iterates over characters of text
(not over words!)
you should change your code as below:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
stop_words = set(nltk.corpus.stopwords.words('english'))
def stop_word_remover(text):
word_tokens = word_tokenize(text)
word_list = [word for word in word_tokens if word.lower() not in stop_words]
return " ".join(word_list)
stop_word_remover("I don't like ice cream.")
## 'n't like ice cream .'