Fixing probabilities, which do not sum to 1 in the matrix of words

I created a matrix, using answers from these questions - question 1 and question 2. Similar questions about this error did not help to resolve.

But probabilities exceed 1 - ValueError: probabilities do not sum to 1

Please let me know how can I share with you a piece of the df for the reproducibility.

I generated the concurrence matrix, using this code

# Create matrix
my_df = pd.DataFrame(0, columns = words, index = words)
for k,v in frequency_list.items():
my_df.at[k[0],k[1]] = v

which gives me the matrix 10000*10000.

Then I converted into frequencies

row_sums = my_df.values.sum(axis = 1)
row_sums[row_sums == 0] = 1
my_prob = my_df/row_sums.reshape((-1,1)) 
my_prob

When I print one word

my_prob.sum().tail(30)

I have a probability above 1.

“thy               0.000000
“till              0.002538
**“to              1.109681**

Tried to normalize

Pick the word the and generate a list

word_the = my_string_prob['the'].tolist()

Try to normalize probabilities

sum_of_elements = sum(word_the)
a = 1/sum_of_elements
my_probs_scaled = [e*a for e in word_the]
my_probs_scaled
sum(my_probs_scaled)
### Output 1.000000000000005

This code worked on a smaller matrix, which was not so big and complex in one of questions above. Thanks!

Solution

You can control the precision of your floating point numbers using decimal in python. Consider the following as an example:

from decimal import Decimal as D
from decimal import getcontext
getcontext().prec = 8

word_the = [9, 4, 5, 4]
sum_of_elements = sum(word_the)
a = D(1/sum_of_elements)
my_probs_scaled = [D(e)*a for e in word_the]
print(my_probs_scaled)
print(sum(my_probs_scaled))

And the output is:

[Decimal('0.40909091'), Decimal('0.18181818'), Decimal('0.22727273'), Decimal('0.18181818')]
1.0000000

You can play around with the parameters, including the precision.