Search code examples
pythonnltkinformation-theory

Calculating PMI for bigram and discrepancy


Suppose I have the following text:

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

I can calculate the PMI for bigram using NLTK as follow:

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
    print(i)

which gives:

(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sentence'), 2.523561956057013)
(('black', 'sheep'), 2.523561956057013)
(('sheep', 'foo'), 2.353636954614701)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)

Now to check my own understanding I want to find the PMI for PMI('black', 'sheep'). PMI formula is given as:

$$ pmi(w1,w2) = \ $$

There are 4 instances of 'black' in the text, there are 3 instances of 'sheep' in the text and black and sheep come together 3 times, the length of the text is 23. Now following the formula I do:

np.log((3/23)/((4/23)*(3/23)))

That gives 1.749199854809259 rather than 2.523561956057013. I wonder why is there a discrepancy here? what am I missing here?


Solution

  • Your PMI formula uses a logarithm in base 2 instead of a base e.

    From NumPy's documentation, numpy.log is a Natural logarithm in base e, which is not what you want.

    The following formula would give you the result of 2.523561956057013:

    math.log((3/23)/((4/23)*(3/23)), 2)