Search code examples
pythonnlpnltksmoothing

Kneser-Ney smoothing of trigrams using Python NLTK


I'm trying to smooth a set of n-gram probabilities with Kneser-Ney smoothing using the Python NLTK. Unfortunately, the whole documentation is rather sparse.

What I'm trying to do is this: I parse a text into a list of tri-gram tuples. From this list I create a FreqDist and then use that FreqDist to calculate a KN-smoothed distribution.

I'm pretty sure though, that the result is totally wrong. When I sum up the individual probabilities I get something way beyond 1. Take this code example:

import nltk

ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")

freq_dist = nltk.FreqDist(ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
    prob_sum += kneser_ney.prob(i)
print(prob_sum)

The output is "41.51696428571428". Depending on the corpus size, this value grows infinitely large. That makes whatever prob() returns anything but a probability distribution.

Looking at the NLTK code I would say that the implementation is questionable. Maybe I just don't understand how the code is supposed to be used. In that case, could you give me a hint please? In any other case: do you know any working Python implementation? I don't really want to implement it myself.


Solution

  • The Kneser-Ney (also have a look at Goodman and Chen for a great survey on different smoothing techniques) is a quite complicated smoothing which only a few package that I am aware of got it right. Not aware of any python implementation, but you can definitely try SRILM if you just need probabilities, etc.

    • There is a good chance that your sample has words that didn't occur in training data (aka Out-Of-Vocabulary (OOV) words), which if not handled properly can mess up the probabilities you get. Perhaps this can cause getting outrageously large and invalid prob?