Trying to find the probability of a phrase using bi-gram
filename.txt
# how many times bigram occurs
bg_count = bigrams.count(('word1', 'word2'))
# probabilty of bigram in text P(word1 word2)
bg_count/number_of_bigrams
In a bigram langauge model:
P(w1,w2,w3,...wn) = P(w1)*P(w2|w1)*P(w3|w2).....*P(wn-1|wn)
So P(life, might) = P(life)*P(might|life)
where
P(life) = Count(life)/Number of unigrams
P(might|life) = Count(life, might)/Count(life)
Code to calculate P(life might)
using bigram model:
p_life = s.count("life")/len(s)
p_might_given_life = bigrams.count(('life', 'might'))/s.count('life')
p_life_might = p_life * p_might_given_life
print (p_life_might)
Output:
0.0024752475247524753
Log probabilities
Since probabilities <=1 and it is not safe to multiply many small numbers, we usually use the log probabilities since it converts multiplication to additions. And since the log is a monotone increasing function, the comparison between different log probabilities will be the same as comparing the actual probabilities.
log(p_life_might) = log(p_life * p_might_given_life)
= log(p_life) + log(p_might_given_life)
Code:
print (math.log(p_life)+math.log(p_might_given_life))
output:
-6.0014148779611505