Search code examples
pythonnltkcorpustagged-corpus

What is the Probability of ‘begining’ given ‘the’?


Using an NLTK Conditional Frequency Distribution and the nltk.bigrams function, train a bigram model on the Genesis:

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
Answer the following questions

What is the Probability of ‘begining’ given ‘the’?
What is the probability of ‘the’?

Note: The probabilities you give as an answer MUST be probabilities computable from this corpus.

Hi, can some help me? this is in the nltk book. When I got it, I got 78% which does not make sense. Im trying to compute this in Python.


Solution

  • There is sort of a difference between probability of 'beginning' intersect 'the'

    p('beginning','the')
    

    and probability of 'beginning' given 'the':

    p('beginning'|'the') = p('beginning','the') / p('the')
    

    try:

    from collections import Counter
    
    import nltk
    
    text = nltk.corpus.genesis.words('english-kjv.txt')
    bigrams = nltk.bigrams(text)
    cfd_bigrams = Counter(bigrams)
    cfd_unigrams = Counter(list(text))
    
    print "p('said','unto') =", cfd_bigrams[u'said', u'unto'] / float(sum(cfd_bigrams.values()))
    
    print "p('said'|'unto') =", (cfd_bigrams[u'said', u'unto'] / float(sum(cfd_bigrams.values()))) / cfd_unigrams[u'unto']
    
    print "p('beginning','the') =", cfd_bigrams[u'beginning', u'the']
    

    [out]:

    p('said','unto') = 0.00397649844738
    p('said'|'unto') = 6.73982787691e-06
    p('beginning','the') = 0