Search code examples
pythonnlpnltktext-parsingpos-tagger

What does NN VBD IN DT NNS RB means in NLTK?


when I chunk text, I get lots of codes in the output like NN, VBD, IN, DT, NNS, RB. Is there a list documented somewhere which tells me the meaning of these? I have tried googling nltk chunk code nltk chunk grammar nltk chunk tokens.

But I am not able to find any documentation which explains what these codes mean.


Solution

  • The tags that you see are not a result of the chunks but the POS tagging that happens before chunking. It's the Penn Treebank tagset, see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

    >>> from nltk import word_tokenize, pos_tag, ne_chunk
    >>> sent = "This is a Foo Bar sentence."
    # POS tag.
    >>> nltk.pos_tag(word_tokenize(sent))
    [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('Foo', 'NNP'), ('Bar', 'NNP'), ('sentence', 'NN'), ('.', '.')]
    >>> tagged_sent = nltk.pos_tag(word_tokenize(sent))
    # Chunk.
    >>> ne_chunk(tagged_sent)
    Tree('S', [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]), ('sentence', 'NN'), ('.', '.')])
    

    To get the chunks look for subtrees within the chunked outputs. From the above output, the Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]) indicates the chunk.

    This tutorial site is pretty helpful to explain the chunking process in NLTK: http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/dongqing-chunking.pdf.

    For official documentation, see http://www.nltk.org/howto/chunk.html