Search code examples
pythonnumpyscipysparse-matrix

Efficiently Subtract Vector from Matrix (Scipy)


I've got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common task when you're doing things like normalization/standardization, but I can't seem to find the proper way to do this efficiently.

Here's an example to demonstrate:

# mat is a 3x3 matrix
mat = scipy.sparse.csc_matrix([[1, 2, 3],
                               [2, 3, 4],
                               [3, 4, 5]])

#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T

""" 
I want to subtract `vec` from each of the columns in `mat` yielding...
    [[0, 1, 2],
     [0, 1, 2],
     [0, 1, 2]]
"""

One way to accomplish what I want is to hstack vec to itself 3 times, yielding a 3x3 matrix where each column is vec and then subtract that from mat. But again, I'm looking for a way to do this efficiently, and the hstacked matrix takes a long time to create. I'm sure there's some magical way to do this with slicing and broadcasting, but it eludes me.

Thanks!

EDIT: Removed the 'in-place' constraint, because sparsity structure would be constantly changing in an in-place assignment scenario.


Solution

  • For a start what would we do with dense arrays?

    mat-vec.A # taking advantage of broadcasting
    mat-vec.A[:,[0]*3] # explicit broadcasting
    mat-vec[:,[0,0,0]] # that also works with csr matrix
    

    In https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566 we found that using as_strided on the mat.indptr vector is the most efficient way of stepping through the rows of a sparse matrix. (The x.rows, x.cols of an lil_matrix are nearly as good. getrow is slow). This function implements such as iteration.

    def sum(X,v):
        rows, cols = X.shape
        row_start_stop = as_strided(X.indptr, shape=(rows, 2),
                                strides=2*X.indptr.strides)
        for row, (start, stop) in enumerate(row_start_stop):
            data = X.data[start:stop]
            data -= v[row]
    
    sum(mat, vec.A)
    print mat.A
    

    I'm using vec.A for simplicity. If we keep vec sparse we'd have to add a test for nonzero value at row. Also this type of iteration only modifies the nonzero elements of mat. 0's are unchanged.

    I suspect the time advantages will depend a lot on the sparsity of matrix and vector. If vec has lots of zeros, then it makes sense to iterate, modifying only those rows of mat where vec is nonzero. But vec is nearly dense like this example, it may be hard to beat mat-vec.A.