Search code examples
nlpnltkn-gram

How to interpret Python NLTK bigram likelihood ratios?


I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question).

import nltk.collocations
import nltk.corpus
import collections

bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())
scored = finder.score_ngrams(bgm.likelihood_ratio)

# Group bigrams by first word in bigram.                                        
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
    prefix_keys[key[0]].append((key[1], scores))

for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

prefix_keys['baseball']

With the following output:

[('game', 32.11075451975229),
 ('cap', 27.81891372457088),
 ('park', 23.509042621473505),
 ('games', 23.10503351305401),
 ("player's", 16.22787286342467),
 ('rightfully', 16.22787286342467),
[...]

Looking at the docs, it looks like the likelihood ratio printed next to each bigram is from

"Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4."

Referring to this article, which states on pg. 22:

One advantage of likelihood ratios is that they have a clear intuitive interpretation. For example, the bigram powerful computers is e^(.5*82.96) = 1.3*10^18 times more likely under the hypothesis that computers is more likely to follow powerful than its base rate of occurrence would suggest. This number is easier to interpret than the scores of the t test or the 2 test which we have to look up in a table.

What I'm confused about is what would be the "base rate of occurence" in the event that I'm using the nltk code noted above with my own data. Would it be safe to say, for example, that "game" is 32 times more likely to appear next to "baseball" in the current dataset than in the average use of the standard English language? Or is it that "game" is more likely to appear next to "baseball" than other words appearing next to "baseball" within the same set of data?

Any help/guidance towards a clearer interpretation or example is much appreciated!


Solution

  • nltk does not have a universal corpus of English language usage from which to model the probability of 'game' following 'baseball'.

    using the corpus it does have available, the likelihood is calculated as the posterior probability of ‘baseball’ given the word before being ‘game’.

    nltk.corpus.brown 
    

    is a built in corpus, or set of observations, and the predictive power of any probability-based model is entirely defined by the observations used to construct or train it.

    nltk.collocations.BigramAssocMeasures().raw_freq
    

    models raw frequency with t tests, not well suited to sparse data such as bigrams, thus the provision of the likelihood ratio.

    The likelihood ratio as calculated by Manning and Schutze is not equivalent to frequency.

    https://nlp.stanford.edu/fsnlp/promo/colloc.pdf

    Section 5.3.4 describes their calculations in detail on how the calculation is done.

    The likelihood can be infinitely large.

    This chart may be helpful:

    bigram likelihood scores by Manning and Schutze

    The likelihood is calculated as the leftmost column.