Search code examples
pythonperformancescipysparse-matrix

Faster way to build a matrix of expected frequencies from a sparse matrix of counts


I have a Compressed Sparse Row matrix containing counts. I want to build a matrix containing the expected frequencies for these counts. Here's the code I'm currently using:

from scipy.sparse import coo_matrix

#m is a csr_matrix

col_total = m.sum(axis=0)
row_total = m.sum(axis=1)
n = int(col_total.sum(axis=1))
A = coo_matrix(m)

for i,j in zip(A.row,A.col):
    m[i,j]= col_total.item(j)*row_total.item(i)/n

This works fine on a small matrix. On a bigger matrix (>1Gb), the for loop takes several days to run. Is there any way I could make this faster?


Solution

  • To expand a little on @hpaulj's answer, you can get rid of the for loop by creating the output matrix directly from the expected frequencies and the row/column indices of the non-zero elements in m:

    from scipy import sparse
    import numpy as np
    
    def fast_efreqs(m):
    
        col_total = np.array(m.sum(axis=0)).ravel()
        row_total = np.array(m.sum(axis=1)).ravel()
    
        # I'm casting this to an int for consistency with your version, but it's
        # not clear to me why you would want to do this...
        grand_total = int(col_total.sum())
    
        ridx, cidx = m.nonzero()            # indices of non-zero elements in m
        efreqs = row_total[ridx] * col_total[cidx] / grand_total
    
        return sparse.coo_matrix((efreqs, (ridx, cidx)))
    

    For comparison, here's your original code as a function:

    def orig_efreqs(m):
    
        col_total = m.sum(axis=0)
        row_total = m.sum(axis=1)
        n = int(col_total.sum(axis=1))
    
        A = sparse.coo_matrix(m)
        for i,j in zip(A.row,A.col):
            m[i,j]= col_total.item(j)*row_total.item(i)/n
    
        return m
    

    Test equivalence on a small matrix:

    m = sparse.rand(100, 100, density=0.1, format='csr')
    print((orig_efreqs(m.copy()) != fast_efreqs(m)).nnz == 0)
    # True
    

    Benchmark performance on a larger matrix:

    In [1]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr')
       .....: orig_efreqs(m)
       .....: 
    1 loops, best of 3: 2min 25s per loop
    
    In [2]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr')
       .....: fast_efreqs(m)
       .....: 
    10 loops, best of 3: 38.3 ms per loop