I am computing the cosine similarity between matrix of vectors, and I get the result in a sparse matrix like this:
- (0, 26) 0.359171459261
- (0, 25) 0.121145761751
- (0, 24) 0.316922015914
- (0, 23) 0.157622038039
- (0, 22) 0.636466644041
- (0, 21) 0.136216495731
- (0, 20) 0.243164535496
- (0, 19) 0.348272617805
- (0, 18) 0.636466644041
- (0, 17) 1.0
But there are duplicates for example:
(0, 24) 0.316922015914 and (24, 0) 0.316922015914
What I want to do is to remove them by indice and be (if I have (0,24) then I don't need (24, 0) because it is the same) left with only one of this and remove the second, for all vectors in the matrix. Currently I have the following code to create the matrix:
vectorized_words = sparse.csr_matrix(vectorize_words(nostopwords,glove_dict))
cos_similiarity = cosine_similarity(vectorized_words,dense_output=False)
So to summarize I don't want to remove all duplicates, I want to be left with only one of them using the pythonic way.
Thank you in advance !
I think it is easiest to get the upper-triangle of a coo
format matrix:
First make a small symmetric matrix:
In [876]: A = sparse.random(5,5,.3,'csr')
In [877]: A = A+A.T
In [878]: A
Out[878]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 11 stored elements in Compressed Sparse Row format>
In [879]: A.A
Out[879]:
array([[ 0. , 0. , 0.81388978, 0. , 0. ],
[ 0. , 0. , 0.73944395, 0.20736975, 0.98968617],
[ 0.81388978, 0.73944395, 0. , 0. , 0. ],
[ 0. , 0.20736975, 0. , 0.05581152, 0.04448881],
[ 0. , 0.98968617, 0. , 0.04448881, 0. ]])
Convert to coo
, and set the lower-triangle data values to 0
In [880]: Ao = A.tocoo()
In [881]: mask = (Ao.row>Ao.col)
In [882]: mask
Out[882]:
array([False, False, False, False, True, True, True, False, False,
True, True], dtype=bool)
In [883]: Ao.data[mask]=0
Convert back to 0, and use eliminate_zeros
to prune the matrix.
In [890]: A1 = Ao.tocsr()
In [891]: A1
Out[891]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 11 stored elements in Compressed Sparse Row format>
In [892]: A1.eliminate_zeros()
In [893]: A1
Out[893]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [894]: A1.A
Out[894]:
array([[ 0. , 0. , 0.81388978, 0. , 0. ],
[ 0. , 0. , 0.73944395, 0.20736975, 0.98968617],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0.05581152, 0.04448881],
[ 0. , 0. , 0. , 0. , 0. ]])
Both the coo
and csr
formats have a in-place eliminate_zeros
method.
def eliminate_zeros(self):
"""Remove zero entries from the matrix
This is an *in place* operation
"""
mask = self.data != 0
self.data = self.data[mask]
self.row = self.row[mask]
self.col = self.col[mask]
Instead of using Ao.data[mask]=0
you could this code as a model for eliminating just the lower_triangle values.