Search code examples
pythonnlpnltkn-gramlinguistics

How to pass in an estimator to NLTK's NgramModel?


I am using NLTK to train a bigram model using a Laplace estimator. The contructor for the NgramModel is:

def __init__(self, n, train, pad_left=True, pad_right=False,
             estimator=None, *estimator_args, **estimator_kwargs):

After some research, I found that a syntax that works is the following:

bigram_model = NgramModel(2, my_corpus, True, False, lambda f, b:LaplaceProbDist(f))

Although it seems to work correctly, I am confused about the last two arguments. Mainly, why is the 'estimator' argument a lambda function and how is interacting with the LaplaceProbDist?


Solution

  • Currently, you can use a lambda function to return the Freqdist from a distribution, e.g.

    from nltk.model import NgramModel
    from nltk.corpus import brown
    from nltk.probability import LaplaceProbDist
    
    est = lambda fdist: LaplaceProbDist(fdist)
    
    corpus = brown.words(categories='news')[:100]
    lm = NgramModel(3, corpus, estimator=est)
    
    
    print lm
    print (corpus[8], corpus[9], corpus[12] )
    print (lm.prob(corpus[12], [corpus[8], corpus[9]]) )
    print
    

    [out]:

    <NgramModel with 100 3-grams>
    (u'investigation', u'of', u'primary')
    0.0186667723526
    

    But do note that the model package within NLTK that contains the LanguageModel object is "under-construction" so when the stable version comes up, the above code might not work.

    To keep updated on the issues related to the model package check these issues regularly: