Search code examples
pythondictionarynlpnltkcorpus

Create Dictionary from Penn Treebank Corpus sample from NLTK?


I know that the Treebank corpus is already tagged, but unlike the Brown corpus, I can't figure out how to get a dictionary of tags. For instance,

>>> from nltk.corpus import brown
>>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words())

This doesn't work on the Treebank corpus?


Solution

  • Quick solution:

    >>> from nltk.corpus import treebank
    >>> from nltk import ConditionalFreqDist as cfd
    >>> from itertools import chain
    >>> treebank_tagged_words = list(chain(*list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()]))))
    >>> wordcounts = cfd(treebank_tagged_words)
    >>> treebank_tagged_words[0]
    (u'Pierre', u'NNP')
    >>> wordcounts[u'Pierre']
    FreqDist({u'NNP': 1})
    >>> treebank_tagged_words[100]
    (u'asbestos', u'NN')
    >>> wordcounts[u'asbestos']
    FreqDist({u'NN': 11})
    

    For more details, see https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders#Penn_Tree_Bank

    See also: Is there a way of avoiding so many list(chain(*list_of_list))?


    Note that there are only 3000+ sentences from the Penn Treebank sample from NLTK, the brown corpus has 50,000 sentences.

    To split the sentences up into training and test set:

    from nltk.corpus import treebank
    from nltk import ConditionalFreqDist as cfd
    from itertools import chain
    
    treebank_tagged_sents = list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()]))
    
    
    total_len = len(treebank_tagged_sents)
    train_len = int(90 * total_len /100)
    
    train_set = treebank_tagged_sents[:train_len]
    print len(train_set)
    train_treebank_tagged_words = cfd(chain(*train_set))
    
    test_set = treebank_tagged_sents[train_len:]
    print len(test_set)
    test_treebank_tagged_words = cfd(chain(*test_set))
    

    If you're going to use brown corpus (that does not contain parsed sentence), you can used the tagged_sent():

    >>> from nltk.corpus import brown
    >>> brown_tagged_sents = brown.tagged_sents()
    >>> len(brown_tagged_sents)
    57340
    >>> brown_tagged_sents[0]
    [(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
    >>> total_len = len(brown_tagged_sents)
    >>> train_len = int(90 * total_len/100)
    >>> train_set = brown_tagged_sents[:train_len]
    >>> train_brown_tagged_words = cfd(chain(*train_set))
    >>> train_brown_tagged_words['asbestos']
    FreqDist({u'NN': 1})
    

    As @alexis noted, unless you're splitting the corpus at sentence level. The tagged_words() function also exist in the Penn Treebank API in NLTK:

    >>> from nltk.corpus import treebank
    >>> from nltk.corpus import brown
    
    >>> treebank.tagged_words()
    [(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), ...]
    >>> brown.tagged_words()
    [(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]
    
    >>> type(treebank.tagged_words())
    <class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
    >>> type(brown.tagged_words())
    <class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
    
    >>> from nltk import ConditionalFreqDist as cfd
    >>> cfd(brown.tagged_words())
    <ConditionalFreqDist with 56057 conditions>
    >>> cfd(treebank.tagged_words())
    <ConditionalFreqDist with 12408 conditions>