python numpy nlp probability probability-theory

Multiple ngrams in transition matrix, probability not adding to 1

I'm trying to find a way to make a transition matrix using unigrams, bigrams, and trigrams for a given text using python and numpy. Each row's probabilities should equal to one. I did this first with bigrams and it worked fine:

distinct_words = list(word_dict.keys())
dwc = len(distinct_words)

matrix = np.zeros((dwc, dwc), dtype=np.float)
for i in range(len(distinct_words)):
    word = distinct_words[i]
    first_word_idx = i
    total = 0
    for bigram, count in ngrams.items():
        word_1, word_2 = bigram.split(" ")
        if word_1 == word:
            total += count
    for bigram, count in ngrams.items():
        word_1, word_2 = bigram.split(" ")
        if word_1 == word:
            second_word_idx = index_dict[word_2]
            matrix[first_word_idx,second_word_idx] = count / total

But now I want to add unigrams and trigrams and weight their probabilities (trigrams * .6, bigrams * .2, unigrams *.2). I don't think my python is very succinct, which is one problem, but also I don't know how to use multiple n-grams (and weights, although honestly weights are secondary) so that I can still get all of the probabilities from any given row to add up to one.

distinct_words = list(word_dict.keys())
dwc = len(distinct_words)

matrix = np.zeros((dwc, dwc), dtype=np.float)
for i in range(len(distinct_words)):
  word = distinct_words[i]
  first_word_index = i 
  bi_total = 0
  tri_total=0
  tri_prob = 0
  bi_prob = 0
  uni_prob = word_dict[word] / len(distinct_words)
  if i < len(distinct_words)-1:
    for trigram, count in trigrams.items():
      word_1, word_2, word_3 = trigram.split()
      if word_1 + word_2 == word + distinct_words[i+1]:
        tri_total += count
    for trigram, count in trigrams.items():
      word_1, word_2, word_3 = trigram.split()
      if word_1 + word_2 == word + distinct_words[i+1]:
        second_word_index = index_dict[word_2]
        tri_prob = count/bigrams[word_1 + " " + word_2]
  for bigram, count in bigrams.items():
    word_1, word_2 = bigram.split(" ")
    if word_1 == word:
      bi_total += count
  for bigram, count in bigrams.items():
    word_1, word_2 = bigram.split(" ")
    if word_1 == word:
      second_word_index = index_dict[word_2]
      bi_prob = count / bi_total
      matrix[first_word_index,second_word_index] = (tri_prob * .4) + (bi_prob * .2) + (word_dict[word]/len(word_dict) *.2)

I'm reading off of this lecture for how to set up my probability matrix and it seems to make sense, but I'm not sure where I'm going wrong.

If it helps, my n_grams are coming from this- it just produces a dictionary of the n_gram as a string and its count.

def get_ngram(words, n):
    word_dict = {}
    for i, word in enumerate(words):
        if i > (n-2):
            n_gram = []
            for num in range(n):
                index = i - num
                n_gram.append(words[index])
            if len(n_gram) > 1:
                formatted_gram = ""
                for word in reversed(n_gram):
                    formatted_gram += word + " "
            else:
                formatted_gram = n_gram[0]
            stripped = formatted_gram.strip() if formatted_gram else n_gram[0]
            if stripped in word_dict:
                word_dict[stripped] += 1
            else:
                word_dict[stripped] = 1
    return word_dict

Solution

Let us try to do it in pure Python in the most efficient way, relying only on list and dictionary comprehensions.

Suppose we have a toy text consisting of 3 words "a", "b", and "c":

np.random.seed(42)
text = " ".join([np.random.choice(list("abc")) for _ in range(100)])
text
'c a c c a a c b c c c c a c b a b b b b a a b b a a a c c c b c b b c 
 b c c a c a c c a a c b a b b b a b a b c c a c c b a b b b b b b b a 
 c b b b b b b c c b c a b a a b c a b a a a a c a a a c a a'

Then to make unigrams, bigrams, and trigrams you can proceed as follows:

unigrams = text.split()
unigram_counts = dict()
for unigram in unigrams:
    unigram_counts[unigram] = unigram_counts.get(unigram, 0) +1

bigrams = ["".join(bigram) for bigram in zip(unigrams[:-1], unigrams[1:])]
bigram_counts = dict()
for bigram in bigrams:
    bigram_counts[bigram] = bigram_counts.get(bigram, 0) +1

trigrams = ["".join(trigram) for trigram in zip(unigrams[:-2], unigrams[1:-1],unigrams[2:])]
trigram_counts = dict()
for trigram in trigrams:
    trigram_counts[trigram] = trigram_counts.get(trigram, 0) +1

To incorporate weights and normalize:

weights = [.2,.2,.6]
dics = [unigram_counts, bigram_counts, trigram_counts]
weighted_counts = {k:v*w for d,w in zip(dics, weights) for k,v in d.items()}
#desired output
freqs = {k:v/sum(weighted_counts.values()) for k,v in weighted_counts.items()}

What we've got:

pprint(freqs)

{'a': 0.06693711967545637,
 'aa': 0.02434077079107505,
 'aaa': 0.024340770791075043,
...

Finally, sanity check:

print(sum(freqs.values()))

0.999999999999999

This code may be further customized to incorporate your tokenization rules e.g., or make it shorter by looping through different grams at once.