Search code examples
pythonnlpnltkcollocation

How to get PMI scores for trigrams with NLTK Collocations? python


I know how to get bigram and trigram collocations using NLTK and I apply them to my own corpora. The code is below.

My only problem is how to print out the birgram with the PMI value? I search NLTK documentation multiple times. It's either I'm missing something or it's not there.

import nltk
from nltk.collocations import *

myFile = open("large.txt", 'r').read()
myList = myFile.split()
myCorpus = nltk.Text(myList)
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words((myCorpus))

finder.apply_freq_filter(3)
print finder.nbest(trigram_measures.pmi, 500000)

Solution

  • If you take a look at the source code for nlkt.collocations.TrigramCollocationFinder (see http://www.nltk.org/_modules/nltk/collocations.html), you'll find that it returns a TrigramCollocationFinder().score_ngrams:

    def nbest(self, score_fn, n):
        """Returns the top n ngrams when scored by the given function."""
        return [p for p,s in self.score_ngrams(score_fn)[:n]]
    

    So you could call the score_ngrams() directly without getting the nbest since it returns a sorted list anyways.:

    def score_ngrams(self, score_fn):
        """Returns a sequence of (ngram, score) pairs ordered from highest to
        lowest score, as determined by the scoring function provided.
        """
        return sorted(self._score_ngrams(score_fn),
                      key=_itemgetter(1), reverse=True)
    

    Try:

    import nltk
    from nltk.collocations import *
    from nltk.tokenize import word_tokenize
    
    text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"
    
    trigram_measures = nltk.collocations.TrigramAssocMeasures()
    finder = TrigramCollocationFinder.from_words(word_tokenize(text))
    
    for i in finder.score_ngrams(trigram_measures.pmi):
        print i
    

    [out]:

    (('this', 'is', 'a'), 9.047123912114026)
    (('is', 'a', 'foo'), 7.46216141139287)
    (('black', 'sheep', 'shep'), 5.46216141139287)
    (('black', 'sheep', 'foo'), 4.877198910671714)
    (('a', 'foo', 'bar'), 4.462161411392869)
    (('sheep', 'shep', 'bar'), 4.462161411392869)
    (('bar', 'black', 'sheep'), 4.047123912114026)
    (('bar', 'black', 'sentence'), 4.047123912114026)
    (('sheep', 'foo', 'bar'), 3.877198910671714)
    (('bar', 'bar', 'black'), 3.047123912114026)
    (('foo', 'bar', 'bar'), 3.047123912114026)
    (('shep', 'bar', 'bar'), 3.047123912114026)