Search code examples
pythonscipyleast-squares

Speed up Least_square scipy


How can I speed up the function least_square? We have six variables (3 orientation angles and 3 axis translation) that need to be optimized. We apply two sets of 3D points, two sets of points on the plane and a projection matrix to the input of the function.

dSeed = np.zeros(6)
optRes = least_squares(minReproj, dSeed, method='lm', max_nfev=600,
    args=(points_prev, points_cur, d3d_prev, d3d_cur, Proj1))

This function minimizes the error of forward-backward projection of points.

def minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix):
    Rmat = genEulerZXZ(dof[0], dof[1], dof[2]) # my function
    translationArray = np.array([[dof[3]], [dof[4]], [dof[5]]])
    temp = np.hstack((Rmat, translationArray))
    perspectiveProj = np.vstack((temp, [0, 0, 0, 1]))

    numPoints = d2dPoints1.shape[0]
    errorA = np.zeros((numPoints,3))
    errorB = np.zeros((numPoints,3))

    forwardProj = np.matmul(w2cMatrix, perspectiveProj)
    backwardProj = np.matmul(w2cMatrix, np.linalg.inv(perspectiveProj))

    for i in range(numPoints):
        Ja = np.ones((3))
        Jb = np.ones((3))
        Wa = np.ones((4))
        Wb = np.ones((4))

        Ja[0:2] = d2dPoints1[i,:]
        Jb[0:2] = d2dPoints2[i,:]
        Wa[0:3] = d3dPoints1[i,:]
        Wb[0:3] = d3dPoints2[i,:]

        JaPred = np.matmul(forwardProj, Wb)
        JaPred /= JaPred[-1]
        e1 = Ja - JaPred

        JbPred = np.matmul(backwardProj, Wa)
        JbPred /= JbPred[-1]
        e2 = Jb - JbPred

        errorA[i,:] = e1
        errorB[i,:] = e2

    residual = np.vstack((errorA,errorB))
    return residual.flatten()
def genEulerZXZ(psi, theta, sigma):
    c1 = cos(psi)
    s1 = sin(psi)
    c2 = cos(theta)
    s2 = sin(theta)
    c3 = cos(sigma)
    s3 = sin(sigma)

    mat = np.zeros((3,3))

    mat[0,0] = (c1 * c3) - (s1 * c2 * s3)
    mat[0,1] = (-c1 * s3) - (s1 * c2 * c3)
    mat[0,2] = (s1 * s2)

    mat[1,0] = (s1 * c3) + (c1 * c2 * s3)
    mat[1,1] = (-s1 * s3) + (c1 * c2 * c3)
    mat[1,2] = (-c1 * s2)

    mat[2,0] = (s2 * s3)
    mat[2,1] = (s2 * c3)
    mat[2,2] = c2

    return mat

This optimization takes 0.2 to 0.4 seconds, and this is too much. Maybe you know how to speed up this process? Or maybe there is another way to find the relative rotation and translation of two point sets? For rpoleski:

       96    0.023    0.000   19.406    0.202 /usr/local/lib/python3.7/dist-packages/scipy/optimize/_lsq/least_squares.py:240(least_squares)
     4548    0.132    0.000   18.986    0.004 /usr/local/lib/python3.7/dist-packages/scipy/optimize/_lsq/least_squares.py:801(fun_wrapped)
       96    0.012    0.000   18.797    0.196 /usr/local/lib/python3.7/dist-packages/scipy/optimize/_lsq/least_squares.py:42(call_minpack)
     4548   11.102    0.002   18.789    0.004 /home/pi/helperFunctions.py:29(minimizeReprojection)

Solution

  • Most likely, during the scipy.optimize.least_squares(), the largest portion of the time is spent in computing the residuals, which in your case take the form of the code in minReproj().

    However, the code you provided in minReproj() seems to have a sub-optimal memory management, which could be greatly improved with pre-allocation. This would result in significant speed gains.


    For example, vstack() and hstack() suffer a significant penalty from having to copy their arguments into the memory of their final result. Consider, the first few lines of minReproj(), which I regrouped in gen_affine_OP(). These can be rewritten as gen_affine() with much better timings (I have also rewritten gen_euler_zxz() to not allocate new memory):

    import numpy as np
    from math import sin, cos
    
    
    def gen_euler_zxz(result, psi, theta, sigma):
        c1 = cos(psi)
        s1 = sin(psi)
        c2 = cos(theta)
        s2 = sin(theta)
        c3 = cos(sigma)
        s3 = sin(sigma)
        result[0,0] = (c1 * c3) - (s1 * c2 * s3)
        result[0,1] = (-c1 * s3) - (s1 * c2 * c3)
        result[0,2] = (s1 * s2)
        result[1,0] = (s1 * c3) + (c1 * c2 * s3)
        result[1,1] = (-s1 * s3) + (c1 * c2 * c3)
        result[1,2] = (-c1 * s2)
        result[2,0] = (s2 * s3)
        result[2,1] = (s2 * c3)
        result[2,2] = c2
        return result
    
    
    def gen_affine(dof):
        result = np.zeros((4, 4))
        gen_euler_zxz(result[:3, :3], dof[0], dof[1], dof[2])
        result[:3, 3] = dof[3:]
        result[3, 3] = 1
        return result
    
    
    def gen_affine_OP(dof):
        Rmat = gen_euler_zxz(np.empty((3, 3)), dof[0], dof[1], dof[2])
        translationArray = np.array([[dof[3]], [dof[4]], [dof[5]]])
        temp = np.hstack((Rmat, translationArray))
        return np.vstack((temp, [0, 0, 0, 1]))
    
    
    dof = 1, 2, 3, 4, 5, 6
    %timeit gen_affine_OP(dof)
    # 100000 loops, best of 3: 16.6 µs per loop
    %timeit gen_affine(dof)
    # 100000 loops, best of 3: 5.02 µs per loop
    

    Similarly, the residual = np.vstack((errorA,errorB)) call can be avoided, by defining a larger residual and providing a view on it for errorA and errorB.

    Another example is when creating Ja, Jb, Wa, Wb:

    def func_OP(numPoints):
        for i in range(numPoints):
            Ja = np.ones((3))
            Jb = np.ones((3))
            Wa = np.ones((4))
            Wb = np.ones((4))
    
    
    def func(n):
        Ja = np.empty(3)
        Jb = np.empty(3)
        Wa = np.empty(3)
        Wb = np.empty(3)
        for i in range(n):
            Ja[:] = 1
            Jb[:] = 1
            Wa[:] = 1
            Wb[:] = 1
    
    
    %timeit func_OP(1000)
    # 100 loops, best of 3: 9.31 ms per loop
    %timeit func(1000)
    # 100 loops, best of 3: 2.2 ms per loop
    

    Also, .flatten() is making a copy you do not need, while .ravel() will just return a view, but that is enough for your needs and comes up much faster:

    a = np.ones((300, 300))
    %timeit a.ravel()
    # 1000000 loops, best of 3: 215 ns per loop
    %timeit a.flatten()
    # 10000 loops, best of 3: 113 µs per loop
    

    The final remarks concerns the speeding up of your main loop. I do not devise a simple vectorized rewriting for that, but you can speed things up with Numba (as long as it compiles in non-object mode).

    To do so, you need to also JIT-decorate gen_euler_zxz() and you need to replace np.matmul() calls with np.dot() (because np.matmul() is not supported by Numba.


    Eventually, the revised minReproj() would look like:

    from math import sin, cos
    import numpy as np
    import numba as nb
    
    
    matmul = np.dot
    
    
    @nb.jit
    def gen_euler_zxz(result, psi, theta, sigma):
        c1 = cos(psi)
        s1 = sin(psi)
        c2 = cos(theta)
        s2 = sin(theta)
        c3 = cos(sigma)
        s3 = sin(sigma)
        result[0, 0] = (c1 * c3) - (s1 * c2 * s3)
        result[0, 1] = (-c1 * s3) - (s1 * c2 * c3)
        result[0, 2] = (s1 * s2)
        result[1, 0] = (s1 * c3) + (c1 * c2 * s3)
        result[1, 1] = (-s1 * s3) + (c1 * c2 * c3)
        result[1, 2] = (-c1 * s2)
        result[2, 0] = (s2 * s3)
        result[2, 1] = (s2 * c3)
        result[2, 2] = c2
        return result
    
    
    @nb.jit
    def minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix):
        perspectiveProj = np.zeros((4, 4))
        gen_euler_zxz(perspectiveProj[:3, :3], dof[0], dof[1], dof[2])
        perspectiveProj[:3, 3] = dof[3:]
        perspectiveProj[3, 3] = 1
    
        numPoints = d2dPoints1.shape[0]
        residual = np.empty((2 * numPoints, 3))
    
        forwardProj = matmul(w2cMatrix, perspectiveProj)
        backwardProj = matmul(w2cMatrix, np.linalg.inv(perspectiveProj))
    
        Ja = np.empty(3)
        Jb = np.empty(3)
        Wa = np.empty(4)
        Wb = np.empty(4)
        for i in range(numPoints):
            j = i + numPoints
    
            Ja[2] = Jb[2] = 1.0
            Wa[3] = Wb[3] = 1.0
            Ja[0:2] = d2dPoints1[i, :]
            Jb[0:2] = d2dPoints2[i, :]
            Wa[0:3] = d3dPoints1[i, :]
            Wb[0:3] = d3dPoints2[i, :]
    
            JaPred = matmul(forwardProj, Wb)
            JaPred /= JaPred[-1]
            residual[i, :] = Ja - JaPred
    
            JbPred = matmul(backwardProj, Wa)
            JbPred /= JbPred[-1]
            residual[j, :] = Jb - JbPred
        return residual.ravel()
    

    Testing it on some toy data shows significant speed-ups, which become drammatic when using Numba appropriately:

    n = 10000
    dof = 1, 2, 3, 4, 5, 6
    d2dPoints1 = np.random.random((n, 2))
    d2dPoints2 = np.random.random((n, 2))
    d3dPoints1 = np.random.random((n, 3))
    d3dPoints2 = np.random.random((n, 3))
    w2cMatrix = np.random.random((3, 4))
    x = minReproj_OP(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
    y = minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
    z = minReproj_nb(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
    print(np.allclose(x, y))
    # True
    print(np.allclose(x, z))
    # True
    
    
    %timeit minReproj_OP(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
    # 1 loop, best of 3: 277 ms per loop
    %timeit minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
    # 10 loops, best of 3: 177 ms per loop
    %timeit minReproj_nb(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
    # 100 loops, best of 3: 5.69 ms per loop