Search code examples
python-3.xnlpnltkcorpuspos-tagger

How can I get verbs, nouns, adjectives from brown corpus?


I have been trying to get all the nouns, verbs..etc separately from the brown corpus, so I tried to use the code

brown.all_synsets('n')

but apparently this code works with wordnet only. I am using python 3.4 by the way.


EDITED

@alvas answer worked. But when I used it with random it gets an error. Have a look.

nn = {word for word, pos in brown.tagged_words() if pos.startswith('NN')}
print(nn)

the output is

{'such', 'rather', 'Quite', 'Such', 'quite'}

but when I use

random.choice(nn)

I get

Traceback (most recent call last):
  File "/home/aziz/Desktop/2222.py", line 5, in <module>
    print(random.choice(NN))
  File "/usr/lib/python3.4/random.py", line 256, in choice
    return seq[i]
TypeError: 'set' object does not support indexing

Solution

  • TL;DR

    >>> from nltk.corpus import brown
    >>> {word for word, pos in brown.tagged_words() if pos.startswith('NN')}
    

    In Longer

    Iterate through the .tagged_words() function and that will return a list of ('word', 'POS') tuples:

    >>> from nltk.corpus import brown
    >>> brown.tagged_words()
    [(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]
    

    Please read this chapter to know how NLTK corpora API works: http://www.nltk.org/book/ch02.html

    Then, do a list comprehension over it and save a set (i.e. unique list) of the words that are tagged with the noun tags, e.g. NN, NNS, NNP, etc..

    >>> {word for word, pos in brown.tagged_words() if pos.startswith('NN')}
    

    Note that the output might not be what you expect because words that are POS tagged with syntactic and syntactic noun is not necessary a semantic argument/entity.


    Also, I don't think that the words you've extracted are correct. Double checking the list:

    >>> nouns = {word for word, pos in brown.tagged_words() if pos.startswith('NN')} 
    >>> 'rather' in nouns
    False
    >>> 'such' in nouns
    False
    >>> 'Quite' in nouns
    False
    >>> 'quite' in nouns
    False
    >>> 'Such' in nouns
    False
    

    The output to the list comprehension: http://pastebin.com/bJaPdpUk


    Why random.choice(nn) fails when nn is a set?

    The input to random.choice() is a sequence (see https://docs.python.org/2/library/random.html#random.choice).

    random.choice(seq)

    Return a random element from the non-empty sequence seq. If seq is empty, raises IndexError.

    And python sequence types in python are

    Since set isn't a sequence, you will get the IndexError.