Search code examples
pythonfor-loopnumpyscipyblas

Many small matrices speed-up for loops


I have a large coordinate grid (vectors a and b), for which I generate and solve a matrix (10x10) equation. Is there a way for scipy.linalg.solve to accept vector input? So far my solution was to run for cycles over the coordinate arrays.

import time
import numpy as np
import scipy.linalg

N = 10
a = np.linspace(0, 1, 10**3)
b = np.linspace(0, 1, 2*10**3)
A = np.random.random((N, N))      # input matrix, not static

def f(a,b,n):     # array-filling function
   return a*b*n

def sol(x,y):   # matrix solver
   D = np.arange(0,N)
   B = f(x,y,D)**2 + f(x-1, y+1, D)      # source vector
   X = scipy.linalg.solve(A,B)
   return X    # output an N-size vector

start = time.time()

answer = np.zeros(shape=(a.size, b.size)) # predefine output array

for egg in range(a.size):             # an ugly double-for cycle
   for ham in range(b.size):
       aa = a[egg]
       bb = b[ham]
       answer[egg,ham] = sol(aa,bb)[0]

print time.time() - start

Solution

  • @yevgeniy You are right, efficiently solving multiple independent linear systems A x = b with scipy a bit tricky (assuming an A array that changes for every iteration).

    For instance, here is a benchmark for solving 1000 systems of the form, A x = b, where A is a 10x10 matrix, and b a 10 element vector. Surprisingly, the approach to put all this into one block diagonal matrix and call scipy.linalg.solve once is indeed slower both with dense and sparse matrices.

    import numpy as np
    from scipy.linalg import block_diag, solve
    from scipy.sparse import block_diag as sp_block_diag
    from scipy.sparse.linalg import spsolve
    
    N = 10
    M = 1000 # number of coordinates 
    Ai = np.random.randn(N, N) # we can compute the inverse here,
    # but let's assume that Ai are different matrices in the for loop loop
    bi = np.random.randn(N)
    
    %timeit [solve(Ai, bi) for el in range(M)]
    # 10 loops, best of 3: 32.1 ms per loop
    
    Afull = sp_block_diag([Ai]*M, format='csr')
    bfull = np.tile(bi, M)
    
    %timeit Afull = sp_block_diag([Ai]*M, format='csr')
    %timeit spsolve(Afull, bfull)
    
    # 1 loops, best of 3: 303 ms per loop
    # 100 loops, best of 3: 5.55 ms per loop
    
    Afull = block_diag(*[Ai]*M) 
    
    %timeit Afull = block_diag(*[Ai]*M)
    %timeit solve(Afull, bfull)
    
    # 100 loops, best of 3: 14.1 ms per loop
    # 1 loops, best of 3: 23.6 s per loop
    

    The solution of the linear system, with sparse arrays is faster, but the time to create this block diagonal array is actually very slow. As to dense arrays, they are simply slower in this case (and take lots of RAM).

    Maybe I'm missing something about how to make this work efficiently with sparse arrays, but if you are keeping the for loops, there are two things that you could do for optimizations.

    From pure python, look at the source code of scipy.linalg.solve : remove unnecessary tests and factorize all repeated operations outside of your loops. For instance, assuming your arrays are not symmetrical positives, we could do

    from scipy.linalg import get_lapack_funcs
    
    gesv, = get_lapack_funcs(('gesv',), (Ai, bi))
    
    def solve_opt(A, b, gesv=gesv):
        # not sure if copying A and B is necessary, but just in case (faster if arrays are not copied)
        lu, piv, x, info = gesv(A.copy(), b.copy(), overwrite_a=False, overwrite_b=False)
        if info == 0:
            return x
        if info > 0:
            raise LinAlgError("singular matrix")
        raise ValueError('illegal value in %d-th argument of internal gesv|posv' % -info)
    
    %timeit [solve(Ai, bi) for el in range(M)]
    %timeit [solve_opt(Ai, bi) for el in range(M)]
    
    # 10 loops, best of 3: 30.1 ms per loop
    # 100 loops, best of 3: 3.77 ms per loop
    

    which results in a 6.5x speed up.

    If you need even better performance, you would have to port this for loop in Cython and interface the gesv BLAS functions directly in C, as discussed here, or better with the Cython API for BLAS/LAPACK in Scipy 0.16.

    Edit: As @Eelco Hoogendoorn mentioned if your A matrix is fixed, there is a much simpler and more efficient approach.