I am new to both, python and NLTK. I have to extract noun phrase from the corpus and then remove the stop words by using NLTK. I already do my coding but still have error. Can anyone help me to fix this problem? Or please also recommend if there is any better solution. Thank you
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
docid='19509'
title='Example noun-phrase and stop words'
print('Document id:'),docid
print('Title:'),title
#list noun phrase
content='This is a sample sentence, showing off the stop words filtration.'
is_noun = lambda pos: pos[:2] == 'NN'
tokenized = nltk.word_tokenize(content)
nouns = [word for (word,pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print('All Noun Phrase:'),nouns
#remove stop words
stop_words = set(stopwords.words("english"))
example_words = word_tokenize(nouns)
filtered_sentence = []
for w in example_words:
if w not in stop_words:
filtered_sentence.append(w)
print('Without stop words:'),filtered_sentence
And I got the following error
Traceback (most recent call last):
File "C:\Users\User\Desktop\NLP\stop_word.py", line 20, in <module>
example_words = word_tokenize(nouns)
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 109,in
word_tokenize
return [token for sent in sent_tokenize(text, language)
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in
sent_tokenize
return tokenizer.tokenize(text)
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in
tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in
sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text,realign_boundaries)]
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in
span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in
_realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in
_pair_iter
prev = next(it)
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in
_slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
You are getting this error because the function word_tokenize
is expecting a string as an argument and you give a list of strings.
As far as I understand what you are trying to achieve, you do not need tokenize at this point. Until the print('All Noun Phrase:'),nouns
, you have all the nouns of your sentence. To remove the stopwords, you can use:
### remove stop words ###
stop_words = set(stopwords.words("english"))
# find the nouns that are not in the stopwords
nouns_without_stopwords = [noun for noun in nouns if noun not in stop_words]
# your sentence is now clear
print('Without stop words:',nouns_without_stopwords)
Of course, in this case you have the same result with nouns, because none of the nouns was a stopword.
I hope this helps.