Search code examples
pythonmatrixscipysparse-matrix

Python sparse matrix remove duplicate indices except one?


I am computing the cosine similarity between matrix of vectors, and I get the result in a sparse matrix like this:

  • (0, 26) 0.359171459261
  • (0, 25) 0.121145761751
  • (0, 24) 0.316922015914
  • (0, 23) 0.157622038039
  • (0, 22) 0.636466644041
  • (0, 21) 0.136216495731
  • (0, 20) 0.243164535496
  • (0, 19) 0.348272617805
  • (0, 18) 0.636466644041
  • (0, 17) 1.0

But there are duplicates for example:

(0, 24) 0.316922015914 and (24, 0) 0.316922015914

What I want to do is to remove them by indice and be (if I have (0,24) then I don't need (24, 0) because it is the same) left with only one of this and remove the second, for all vectors in the matrix. Currently I have the following code to create the matrix:

vectorized_words = sparse.csr_matrix(vectorize_words(nostopwords,glove_dict))
cos_similiarity = cosine_similarity(vectorized_words,dense_output=False)

So to summarize I don't want to remove all duplicates, I want to be left with only one of them using the pythonic way.

Thank you in advance !


Solution

  • I think it is easiest to get the upper-triangle of a coo format matrix:

    First make a small symmetric matrix:

    In [876]: A = sparse.random(5,5,.3,'csr')
    In [877]: A = A+A.T
    In [878]: A
    Out[878]: 
    <5x5 sparse matrix of type '<class 'numpy.float64'>'
        with 11 stored elements in Compressed Sparse Row format>
    In [879]: A.A
    Out[879]: 
    array([[ 0.        ,  0.        ,  0.81388978,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.73944395,  0.20736975,  0.98968617],
           [ 0.81388978,  0.73944395,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.20736975,  0.        ,  0.05581152,  0.04448881],
           [ 0.        ,  0.98968617,  0.        ,  0.04448881,  0.        ]])
    

    Convert to coo, and set the lower-triangle data values to 0

    In [880]: Ao = A.tocoo()
    In [881]: mask = (Ao.row>Ao.col)
    In [882]: mask
    Out[882]: 
    array([False, False, False, False,  True,  True,  True, False, False,
            True,  True], dtype=bool)
    In [883]: Ao.data[mask]=0
    

    Convert back to 0, and use eliminate_zeros to prune the matrix.

    In [890]: A1 = Ao.tocsr()
    In [891]: A1
    Out[891]: 
    <5x5 sparse matrix of type '<class 'numpy.float64'>'
        with 11 stored elements in Compressed Sparse Row format>
    In [892]: A1.eliminate_zeros()
    In [893]: A1
    Out[893]: 
    <5x5 sparse matrix of type '<class 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Row format>
    In [894]: A1.A
    Out[894]: 
    array([[ 0.        ,  0.        ,  0.81388978,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.73944395,  0.20736975,  0.98968617],
           [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  0.05581152,  0.04448881],
           [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])
    

    Both the coo and csr formats have a in-place eliminate_zeros method.


    def eliminate_zeros(self):
        """Remove zero entries from the matrix
    
        This is an *in place* operation
        """
        mask = self.data != 0
        self.data = self.data[mask]
        self.row = self.row[mask]
        self.col = self.col[mask]
    

    Instead of using Ao.data[mask]=0 you could this code as a model for eliminating just the lower_triangle values.