I am using NLTK to train a bigram model using a Laplace estimator. The contructor for the NgramModel is:
def __init__(self, n, train, pad_left=True, pad_right=False,
estimator=None, *estimator_args, **estimator_kwargs):
After some research, I found that a syntax that works is the following:
bigram_model = NgramModel(2, my_corpus, True, False, lambda f, b:LaplaceProbDist(f))
Although it seems to work correctly, I am confused about the last two arguments. Mainly, why is the 'estimator' argument a lambda function and how is interacting with the LaplaceProbDist?
Currently, you can use a lambda function to return the Freqdist from a distribution, e.g.
from nltk.model import NgramModel
from nltk.corpus import brown
from nltk.probability import LaplaceProbDist
est = lambda fdist: LaplaceProbDist(fdist)
corpus = brown.words(categories='news')[:100]
lm = NgramModel(3, corpus, estimator=est)
print lm
print (corpus[8], corpus[9], corpus[12] )
print (lm.prob(corpus[12], [corpus[8], corpus[9]]) )
print
[out]:
<NgramModel with 100 3-grams>
(u'investigation', u'of', u'primary')
0.0186667723526
But do note that the model
package within NLTK that contains the LanguageModel object is "under-construction" so when the stable version comes up, the above code might not work.
To keep updated on the issues related to the model
package check these issues regularly: