How to pass in an estimator to NLTK's NgramModel?

I am using NLTK to train a bigram model using a Laplace estimator. The contructor for the NgramModel is:

def __init__(self, n, train, pad_left=True, pad_right=False,
             estimator=None, *estimator_args, **estimator_kwargs):

After some research, I found that a syntax that works is the following:

bigram_model = NgramModel(2, my_corpus, True, False, lambda f, b:LaplaceProbDist(f))

Although it seems to work correctly, I am confused about the last two arguments. Mainly, why is the 'estimator' argument a lambda function and how is interacting with the LaplaceProbDist?

Solution

Currently, you can use a lambda function to return the Freqdist from a distribution, e.g.

from nltk.model import NgramModel
from nltk.corpus import brown
from nltk.probability import LaplaceProbDist

est = lambda fdist: LaplaceProbDist(fdist)

corpus = brown.words(categories='news')[:100]
lm = NgramModel(3, corpus, estimator=est)


print lm
print (corpus[8], corpus[9], corpus[12] )
print (lm.prob(corpus[12], [corpus[8], corpus[9]]) )
print

[out]:

<NgramModel with 100 3-grams>
(u'investigation', u'of', u'primary')
0.0186667723526

But do note that the model package within NLTK that contains the LanguageModel object is "under-construction" so when the stable version comes up, the above code might not work.

To keep updated on the issues related to the model package check these issues regularly:

#792
#800