I have a wordlist, which consists many subjects. The subjects were auto extracted from sentences. I would like to keep only the noun from the subjects. As u can see some of the subjects have adj which i want to delete it.
wordlist=['country','all','middle','various drinks','few people','its reputation','German Embassy','many elections']
returnlist=[]
for word in wordlist:
x=wn.synsets(word)
for syn in x:
if syn.pos() == 'n':
returnlist.append(word)
break
print returnlist
the results of above is :
['country','it', 'middle']
However, I want to get the result should be look like this
wordlist=['country','it', 'middle','drinks','people','reputation','German Embassy','elections']
How to do that?
First your list is a result of not well tokenized text so i tokenized them again
then search pos
of all words to find nouns which pos contains NN :
>>> text=' '.join(wordlist).lower()
>>> tokens = nltk.word_tokenize(text)
>>> tags = nltk.pos_tag(tokens)
>>> nouns = [word for word,pos in tags if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')
]
>>> nouns
['country', 'drinks', 'people', 'Embassy', 'elections']