I suspect this is something very fundamental I don't know or understand about this code; my only excuse is that I am a complete beginner in python.
I am trying some of the cosine similarity matrix calculations from this post:
What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
One of them requires the calculation of the reciprocal of the diagonal of the initial matrix product.
Say that he initial matrix is m
, each row of which represents an 'object', whose 'coordinates' are in the columns of the matrix. So you want to calculate cosine similarities between rows.
Then, to use the matrix product method, you do something like mp = numpy.dot(m, m.T)
.
Now, if there are no rows with only 0's in m
, the diagonal of mp
can never have any zero values, as each of its elements is the sum of the squared elements of the corresponding row of m
.
The m
I am using in my calculations has indeed no rows with all 0's.
And indeed, when I do:
mp = np.dot(m, m.T)
mnorms2 = mp.diagonal()
I can easily test that:
mnorms2.min()
# 32
As I am using a sparse matrix (csr
) for m
, mp
is also sparse, and I need only specific pairs of elements of mnorms2
, which I obtain by:
mp_rows, mp_cols = mp.nonzero()
These are the indices of the elements of mnorms2
that I need to multiply together, take the square root of, and divide mp.data
by.
I saw that the code in the method I was trying went through all the intermediate steps, but I thought it was only for illustration, so I tried to do it in one go instead, like:
mp.data = mp.data / numpy.sqrt(mnorms2[mp_rows] * mnorms2[mp_cols])
And this gave a division by zero error, although I know for sure that no element of mnorms2
is zero!
Worse, it did not do it systematically, but only for some m
's, although in all cases these matrices had similar sparse structure and content.
In fact I even did:
denom = numpy.sqrt(mnorms2[mp_rows] * mnorms2[mp_cols])
and I found that:
denom.min()
# 0.0
How can the (element by element) product of two arrays that have no 0's have any 0's?
The only thing that worked in the end was:
inv = 1 / numpy.sqrt(mnorms2[mp_rows])
inv = inv / numpy.sqrt(mnorms2[mp_cols])
mp.data = mp.data * inv
I really don't understand why going step by step works, whereas the 'all in one go' method causes an error, as the operations should be the same in the end.
And there is clearly something strange going on, because when I try this:
mnorms2[0:5]
# array([71, 73, 77, 68, 72], dtype=uint8)
mnorms2[0:5] * mnorms2[0:5]
# array([177, 209, 41, 16, 64], dtype=uint8)
177 is not the square of 71... :/
What is going on here?
Any suggestions / ideas?
Thanks!
I think the problem is dtype
uint8 : Unsigned integer (0 to 255)
import numpy as np
mnorms2 = np.array([71, 73, 77, 68, 72], dtype='uint8')
mnorms2 * mnorms2
# array([177, 209, 41, 16, 64], dtype=uint8)
But if you change the dtype
to np.float64
:
mnorms2 = np.array([71, 73, 77, 68, 72], dtype=np.float64)
mnorms2 * mnorms2
# array([5041., 5329., 5929., 4624., 5184.])
To change dtype
do:
mnorms2 = mnorms2.astype(np.float64)