Search code examples
pythonnumpyscipysparse-matrixindicator

Scipy: Sparse indicator matrix from array(s)


What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:

I = a[:,None]==b

The following is slow and still memory-inefficient during creation:

I = csr((a[:,None]==b),shape=(len(a),len(b)))

The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:

z = np.argwhere((a[:,None]==b))

Any ideas?


Solution

  • One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:

    import numpy as np
    from scipy import sparse
    
    a = np.random.randint(0, 10, size=(400,))
    b = np.random.randint(0, 10, size=(300,))
    
    ## matrix generation after OP
    I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
    
    ##identifying all values that occur both in a and b:
    values = set(np.unique(a)) & set(np.unique(b))
    
    ##here we collect the indices in a and b where the respective values are the same:
    rows, cols = [], []
    
    ##looping over the common values, finding their indices in a and b, and
    ##generating the 2D indices of the sparse matrix with np.repeat and np.tile
    for value in values:
        x = np.argwhere(a==value).ravel()
        y = np.argwhere(b==value).ravel()    
        rows.append(np.repeat(x, len(x)))
        cols.append(np.tile(y, len(y)))
    
    ##concatenating the indices for different values and generating a 1D vector
    ##of True values for final matrix generation
    rows = np.hstack(rows)
    cols = np.hstack(cols)
    data = np.ones(len(rows),dtype=bool)
    
    ##generating sparse matrix
    I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )
    
    ##checking that the matrix was generated correctly:
    print((I1 != I3).nnz==0)
    

    The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.

    Old Answer:

    I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:

    import numpy as np
    from scipy import sparse
    
    a = np.random.randint(0, 10, size=(400,))
    b = np.random.randint(0, 10, size=(300,))
    
    ## matrix generation after OP
    I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
    
    ## matrix generation using generator
    data, rows, cols = zip(
        *((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
    )
    I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))
    
    ##testing that matrices are equal
    ## from https://stackoverflow.com/a/30685839/2454357
    print((I1 != I2).nnz==0)  ## --> True
    

    I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...