python nlp artificial-intelligence probability n-gram

bi-gram probability

Trying to find the probability of a phrase using bi-gram

filename.txt

# how many times bigram occurs
bg_count = bigrams.count(('word1', 'word2'))

# probabilty of bigram in text P(word1 word2)
bg_count/number_of_bigrams

Solution

In a bigram langauge model:

P(w1,w2,w3,...wn) = P(w1)*P(w2|w1)*P(w3|w2).....*P(wn-1|wn)

So P(life, might) = P(life)*P(might|life) where

P(life) = Count(life)/Number of unigrams
P(might|life) = Count(life, might)/Count(life)

Code to calculate P(life might) using bigram model:

p_life = s.count("life")/len(s)
p_might_given_life = bigrams.count(('life', 'might'))/s.count('life')
p_life_might = p_life * p_might_given_life
print (p_life_might)

Output:

0.0024752475247524753

Log probabilities

Since probabilities <=1 and it is not safe to multiply many small numbers, we usually use the log probabilities since it converts multiplication to additions. And since the log is a monotone increasing function, the comparison between different log probabilities will be the same as comparing the actual probabilities.

log(p_life_might) = log(p_life * p_might_given_life)

= log(p_life) + log(p_might_given_life)

Code:

print (math.log(p_life)+math.log(p_might_given_life))

output:

-6.0014148779611505