Search code examples
numpyoptimizationpytorchscipyleast-squares

Fastest way to compute least-squares with relatively sparse matrix (0.4% of data)


Context: I solve the equation A_des.x = b where I find x through a least-squares method. I have to do this operation several thousands of time. A_des changes between each iteration but remains sparse (always around 4% of actual data in the matrix).

Problem: A_des is large ((28106, 1185)), and solving the linear systems takes approximately 1 second. I need to do this operation 300000 times.

Question: Would you have leads on what could speed-up this process ?

What I tried: I coded my function both on GPU and CPU. I use pytorch for the GPU computations, but the speed-up is inconsistent. I can not parallelize my function because the solver functions are usually already threaded. I have tried using several solver functions on CPU: np.linalg.solve, np.linalg.lstsq, scipy.linalg.solve, scipy.optimize.least_squares, scipy.sparse.linalg.lsqr (with pre-emptively converting A_des with scipy.sparse.coo_array(A_des)).

Best method I found: The best compromise I found was using np.linalg.solve, I just have to do by hand some matrix manipulation. For example, solving the system with a least-square function would be: x = np.linalg.lstsq(A_des, b) while with solve it would be: x = np.linalg.solve(A_des.T@A_des, A_des.T@b). The only other option I found that solves the system faster is: x = scipy.sparse.linalg.lsqr(A_des, b, atol = 1e-3, btol = 1e-3) but then the results are not great.

My function:

def Inverter(GPU, A_des, b):

    if GPU:
        # Migrate b to torch
        b= torch.from_numpy(b).to(device).double()
        # Migrate design matrix to GPU
        A_des = torch.from_numpy(A_des).to(device)
        x= torch.linalg.solve(A_des.T@A_des,A_des.T@(b)).cpu()
    else:
        #x= scipy.sparse.coo_array(A_des)
        #x= scipy.sparse.linalg.lsqr(A_des, b)[0]
        #x = scipy.linalg.solve(A_des.T@A_des,A_des.T@(b))
        x= np.linalg.solve(A_des.T@A_des,A_des.T@b)
        

    return x

Solution

  • You can try using Numba library for parallelizing functions on GPU

    https://numba.readthedocs.io/en/stable/user/jit.html