Search code examples
pythonnlpartificial-intelligenceprobabilityn-gram

bi-gram probability


Trying to find the probability of a phrase using bi-gram

filename.txt

# how many times bigram occurs
bg_count = bigrams.count(('word1', 'word2'))

# probabilty of bigram in text P(word1 word2)
bg_count/number_of_bigrams


Solution

  • In a bigram langauge model:

    P(w1,w2,w3,...wn) = P(w1)*P(w2|w1)*P(w3|w2).....*P(wn-1|wn)

    So P(life, might) = P(life)*P(might|life) where

    • P(life) = Count(life)/Number of unigrams
    • P(might|life) = Count(life, might)/Count(life)

    Code to calculate P(life might) using bigram model:

    p_life = s.count("life")/len(s)
    p_might_given_life = bigrams.count(('life', 'might'))/s.count('life')
    p_life_might = p_life * p_might_given_life
    print (p_life_might)
    

    Output:

    0.0024752475247524753
    

    Log probabilities

    Since probabilities <=1 and it is not safe to multiply many small numbers, we usually use the log probabilities since it converts multiplication to additions. And since the log is a monotone increasing function, the comparison between different log probabilities will be the same as comparing the actual probabilities.

    log(p_life_might) = log(p_life * p_might_given_life)

    = log(p_life) + log(p_might_given_life)

    Code:

    print (math.log(p_life)+math.log(p_might_given_life))
    

    output:

    -6.0014148779611505