I created a matrix, using answers from these questions - question 1 and question 2. Similar questions about this error did not help to resolve.
But probabilities exceed 1 - ValueError: probabilities do not sum to 1
Please let me know how can I share with you a piece of the df for the reproducibility.
I generated the concurrence matrix, using this code
# Create matrix
my_df = pd.DataFrame(0, columns = words, index = words)
for k,v in frequency_list.items():
my_df.at[k[0],k[1]] = v
which gives me the matrix 10000*10000.
Then I converted into frequencies
row_sums = my_df.values.sum(axis = 1)
row_sums[row_sums == 0] = 1
my_prob = my_df/row_sums.reshape((-1,1))
my_prob
When I print one word
my_prob.sum().tail(30)
I have a probability above 1.
“thy 0.000000
“till 0.002538
**“to 1.109681**
Tried to normalize
Pick the word the and generate a list
word_the = my_string_prob['the'].tolist()
Try to normalize probabilities
sum_of_elements = sum(word_the)
a = 1/sum_of_elements
my_probs_scaled = [e*a for e in word_the]
my_probs_scaled
sum(my_probs_scaled)
### Output 1.000000000000005
This code worked on a smaller matrix, which was not so big and complex in one of questions above. Thanks!
You can control the precision of your floating point numbers using decimal in python. Consider the following as an example:
from decimal import Decimal as D
from decimal import getcontext
getcontext().prec = 8
word_the = [9, 4, 5, 4]
sum_of_elements = sum(word_the)
a = D(1/sum_of_elements)
my_probs_scaled = [D(e)*a for e in word_the]
print(my_probs_scaled)
print(sum(my_probs_scaled))
And the output is:
[Decimal('0.40909091'), Decimal('0.18181818'), Decimal('0.22727273'), Decimal('0.18181818')]
1.0000000
You can play around with the parameters, including the precision.