Search code examples

Error when extract noun-phrases from the training corpus and remove stop words using NLTK

I am new to both, python and NLTK. I have to extract noun phrase from the corpus and then remove the stop words by using NLTK. I already do my coding but still have error. Can anyone help me to fix this problem? Or please also recommend if there is any better solution. Thank you

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

title='Example noun-phrase and stop words'
print('Document id:'),docid

#list noun phrase
content='This is a sample sentence, showing off the stop words filtration.'
is_noun = lambda pos: pos[:2] == 'NN'
tokenized = nltk.word_tokenize(content)
nouns = [word for (word,pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print('All Noun Phrase:'),nouns

#remove stop words
stop_words = set(stopwords.words("english"))

example_words = word_tokenize(nouns)
filtered_sentence = []

for w in example_words:
  if w not in stop_words:

print('Without stop words:'),filtered_sentence

And I got the following error

Traceback (most recent call last):
 File "C:\Users\User\Desktop\NLP\", line 20, in <module>
  example_words = word_tokenize(nouns)
 File "C:\Python27\lib\site-packages\nltk\tokenize\", line 109,in 
  return [token for sent in sent_tokenize(text, language)
 File "C:\Python27\lib\site-packages\nltk\tokenize\", line 94, in 
  return tokenizer.tokenize(text)
 File "C:\Python27\lib\site-packages\nltk\tokenize\", line 1237, in 
  return list(self.sentences_from_text(text, realign_boundaries))
 File "C:\Python27\lib\site-packages\nltk\tokenize\", line 1285, in 
  return [text[s:e] for s, e in self.span_tokenize(text,realign_boundaries)]
 File "C:\Python27\lib\site-packages\nltk\tokenize\", line 1276, in 
  return [(sl.start, sl.stop) for sl in slices]
 File "C:\Python27\lib\site-packages\nltk\tokenize\", line 1316, in 
  for sl1, sl2 in _pair_iter(slices):
 File "C:\Python27\lib\site-packages\nltk\tokenize\", line 310, in 
  prev = next(it)
 File "C:\Python27\lib\site-packages\nltk\tokenize\", line 1289, in 
  for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer


  • You are getting this error because the function word_tokenize is expecting a string as an argument and you give a list of strings. As far as I understand what you are trying to achieve, you do not need tokenize at this point. Until the print('All Noun Phrase:'),nouns, you have all the nouns of your sentence. To remove the stopwords, you can use:

    ### remove stop words ###
    stop_words = set(stopwords.words("english"))
    # find the nouns that are not in the stopwords
    nouns_without_stopwords = [noun for noun in nouns if noun not in stop_words]
    # your sentence is now clear
    print('Without stop words:',nouns_without_stopwords)

    Of course, in this case you have the same result with nouns, because none of the nouns was a stopword.

    I hope this helps.