Python sparse matrix remove duplicate indices except one?

I am computing the cosine similarity between matrix of vectors, and I get the result in a sparse matrix like this:

(0, 26) 0.359171459261

(0, 25) 0.121145761751

(0, 24) 0.316922015914

(0, 23) 0.157622038039

(0, 22) 0.636466644041

(0, 21) 0.136216495731

(0, 20) 0.243164535496

(0, 19) 0.348272617805

(0, 18) 0.636466644041

(0, 17) 1.0

But there are duplicates for example:

(0, 24) 0.316922015914 and (24, 0) 0.316922015914

What I want to do is to remove them by indice and be (if I have (0,24) then I don't need (24, 0) because it is the same) left with only one of this and remove the second, for all vectors in the matrix. Currently I have the following code to create the matrix:

vectorized_words = sparse.csr_matrix(vectorize_words(nostopwords,glove_dict))
cos_similiarity = cosine_similarity(vectorized_words,dense_output=False)

So to summarize I don't want to remove all duplicates, I want to be left with only one of them using the pythonic way.

Thank you in advance !

Solution

I think it is easiest to get the upper-triangle of a coo format matrix:

First make a small symmetric matrix:

In [876]: A = sparse.random(5,5,.3,'csr')
In [877]: A = A+A.T
In [878]: A
Out[878]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 11 stored elements in Compressed Sparse Row format>
In [879]: A.A
Out[879]: 
array([[ 0.        ,  0.        ,  0.81388978,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.73944395,  0.20736975,  0.98968617],
       [ 0.81388978,  0.73944395,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.20736975,  0.        ,  0.05581152,  0.04448881],
       [ 0.        ,  0.98968617,  0.        ,  0.04448881,  0.        ]])

Convert to coo, and set the lower-triangle data values to 0

In [880]: Ao = A.tocoo()
In [881]: mask = (Ao.row>Ao.col)
In [882]: mask
Out[882]: 
array([False, False, False, False,  True,  True,  True, False, False,
        True,  True], dtype=bool)
In [883]: Ao.data[mask]=0

Convert back to 0, and use eliminate_zeros to prune the matrix.

In [890]: A1 = Ao.tocsr()
In [891]: A1
Out[891]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 11 stored elements in Compressed Sparse Row format>
In [892]: A1.eliminate_zeros()
In [893]: A1
Out[893]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [894]: A1.A
Out[894]: 
array([[ 0.        ,  0.        ,  0.81388978,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.73944395,  0.20736975,  0.98968617],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.05581152,  0.04448881],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

Both the coo and csr formats have a in-place eliminate_zeros method.

def eliminate_zeros(self):
    """Remove zero entries from the matrix

    This is an *in place* operation
    """
    mask = self.data != 0
    self.data = self.data[mask]
    self.row = self.row[mask]
    self.col = self.col[mask]

Instead of using Ao.data[mask]=0 you could this code as a model for eliminating just the lower_triangle values.