Speed up Least_square scipy

How can I speed up the function least_square? We have six variables (3 orientation angles and 3 axis translation) that need to be optimized. We apply two sets of 3D points, two sets of points on the plane and a projection matrix to the input of the function.

dSeed = np.zeros(6)
optRes = least_squares(minReproj, dSeed, method='lm', max_nfev=600,
    args=(points_prev, points_cur, d3d_prev, d3d_cur, Proj1))

This function minimizes the error of forward-backward projection of points.

def minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix):
    Rmat = genEulerZXZ(dof[0], dof[1], dof[2]) # my function
    translationArray = np.array([[dof[3]], [dof[4]], [dof[5]]])
    temp = np.hstack((Rmat, translationArray))
    perspectiveProj = np.vstack((temp, [0, 0, 0, 1]))

    numPoints = d2dPoints1.shape[0]
    errorA = np.zeros((numPoints,3))
    errorB = np.zeros((numPoints,3))

    forwardProj = np.matmul(w2cMatrix, perspectiveProj)
    backwardProj = np.matmul(w2cMatrix, np.linalg.inv(perspectiveProj))

    for i in range(numPoints):
        Ja = np.ones((3))
        Jb = np.ones((3))
        Wa = np.ones((4))
        Wb = np.ones((4))

        Ja[0:2] = d2dPoints1[i,:]
        Jb[0:2] = d2dPoints2[i,:]
        Wa[0:3] = d3dPoints1[i,:]
        Wb[0:3] = d3dPoints2[i,:]

        JaPred = np.matmul(forwardProj, Wb)
        JaPred /= JaPred[-1]
        e1 = Ja - JaPred

        JbPred = np.matmul(backwardProj, Wa)
        JbPred /= JbPred[-1]
        e2 = Jb - JbPred

        errorA[i,:] = e1
        errorB[i,:] = e2

    residual = np.vstack((errorA,errorB))
    return residual.flatten()

def genEulerZXZ(psi, theta, sigma):
    c1 = cos(psi)
    s1 = sin(psi)
    c2 = cos(theta)
    s2 = sin(theta)
    c3 = cos(sigma)
    s3 = sin(sigma)

    mat = np.zeros((3,3))

    mat[0,0] = (c1 * c3) - (s1 * c2 * s3)
    mat[0,1] = (-c1 * s3) - (s1 * c2 * c3)
    mat[0,2] = (s1 * s2)

    mat[1,0] = (s1 * c3) + (c1 * c2 * s3)
    mat[1,1] = (-s1 * s3) + (c1 * c2 * c3)
    mat[1,2] = (-c1 * s2)

    mat[2,0] = (s2 * s3)
    mat[2,1] = (s2 * c3)
    mat[2,2] = c2

    return mat

This optimization takes 0.2 to 0.4 seconds, and this is too much. Maybe you know how to speed up this process? Or maybe there is another way to find the relative rotation and translation of two point sets? For rpoleski:

       96    0.023    0.000   19.406    0.202 /usr/local/lib/python3.7/dist-packages/scipy/optimize/_lsq/least_squares.py:240(least_squares)
     4548    0.132    0.000   18.986    0.004 /usr/local/lib/python3.7/dist-packages/scipy/optimize/_lsq/least_squares.py:801(fun_wrapped)
       96    0.012    0.000   18.797    0.196 /usr/local/lib/python3.7/dist-packages/scipy/optimize/_lsq/least_squares.py:42(call_minpack)
     4548   11.102    0.002   18.789    0.004 /home/pi/helperFunctions.py:29(minimizeReprojection)

Solution

Most likely, during the scipy.optimize.least_squares(), the largest portion of the time is spent in computing the residuals, which in your case take the form of the code in minReproj().

However, the code you provided in minReproj() seems to have a sub-optimal memory management, which could be greatly improved with pre-allocation. This would result in significant speed gains.

For example, vstack() and hstack() suffer a significant penalty from having to copy their arguments into the memory of their final result. Consider, the first few lines of minReproj(), which I regrouped in gen_affine_OP(). These can be rewritten as gen_affine() with much better timings (I have also rewritten gen_euler_zxz() to not allocate new memory):

import numpy as np
from math import sin, cos


def gen_euler_zxz(result, psi, theta, sigma):
    c1 = cos(psi)
    s1 = sin(psi)
    c2 = cos(theta)
    s2 = sin(theta)
    c3 = cos(sigma)
    s3 = sin(sigma)
    result[0,0] = (c1 * c3) - (s1 * c2 * s3)
    result[0,1] = (-c1 * s3) - (s1 * c2 * c3)
    result[0,2] = (s1 * s2)
    result[1,0] = (s1 * c3) + (c1 * c2 * s3)
    result[1,1] = (-s1 * s3) + (c1 * c2 * c3)
    result[1,2] = (-c1 * s2)
    result[2,0] = (s2 * s3)
    result[2,1] = (s2 * c3)
    result[2,2] = c2
    return result


def gen_affine(dof):
    result = np.zeros((4, 4))
    gen_euler_zxz(result[:3, :3], dof[0], dof[1], dof[2])
    result[:3, 3] = dof[3:]
    result[3, 3] = 1
    return result


def gen_affine_OP(dof):
    Rmat = gen_euler_zxz(np.empty((3, 3)), dof[0], dof[1], dof[2])
    translationArray = np.array([[dof[3]], [dof[4]], [dof[5]]])
    temp = np.hstack((Rmat, translationArray))
    return np.vstack((temp, [0, 0, 0, 1]))


dof = 1, 2, 3, 4, 5, 6
%timeit gen_affine_OP(dof)
# 100000 loops, best of 3: 16.6 µs per loop
%timeit gen_affine(dof)
# 100000 loops, best of 3: 5.02 µs per loop

Similarly, the residual = np.vstack((errorA,errorB)) call can be avoided, by defining a larger residual and providing a view on it for errorA and errorB.

Another example is when creating Ja, Jb, Wa, Wb:

def func_OP(numPoints):
    for i in range(numPoints):
        Ja = np.ones((3))
        Jb = np.ones((3))
        Wa = np.ones((4))
        Wb = np.ones((4))


def func(n):
    Ja = np.empty(3)
    Jb = np.empty(3)
    Wa = np.empty(3)
    Wb = np.empty(3)
    for i in range(n):
        Ja[:] = 1
        Jb[:] = 1
        Wa[:] = 1
        Wb[:] = 1


%timeit func_OP(1000)
# 100 loops, best of 3: 9.31 ms per loop
%timeit func(1000)
# 100 loops, best of 3: 2.2 ms per loop

Also, .flatten() is making a copy you do not need, while .ravel() will just return a view, but that is enough for your needs and comes up much faster:

a = np.ones((300, 300))
%timeit a.ravel()
# 1000000 loops, best of 3: 215 ns per loop
%timeit a.flatten()
# 10000 loops, best of 3: 113 µs per loop

The final remarks concerns the speeding up of your main loop. I do not devise a simple vectorized rewriting for that, but you can speed things up with Numba (as long as it compiles in non-object mode).

To do so, you need to also JIT-decorate gen_euler_zxz() and you need to replace np.matmul() calls with np.dot() (because np.matmul() is not supported by Numba.

Eventually, the revised minReproj() would look like:

from math import sin, cos
import numpy as np
import numba as nb


matmul = np.dot


@nb.jit
def gen_euler_zxz(result, psi, theta, sigma):
    c1 = cos(psi)
    s1 = sin(psi)
    c2 = cos(theta)
    s2 = sin(theta)
    c3 = cos(sigma)
    s3 = sin(sigma)
    result[0, 0] = (c1 * c3) - (s1 * c2 * s3)
    result[0, 1] = (-c1 * s3) - (s1 * c2 * c3)
    result[0, 2] = (s1 * s2)
    result[1, 0] = (s1 * c3) + (c1 * c2 * s3)
    result[1, 1] = (-s1 * s3) + (c1 * c2 * c3)
    result[1, 2] = (-c1 * s2)
    result[2, 0] = (s2 * s3)
    result[2, 1] = (s2 * c3)
    result[2, 2] = c2
    return result


@nb.jit
def minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix):
    perspectiveProj = np.zeros((4, 4))
    gen_euler_zxz(perspectiveProj[:3, :3], dof[0], dof[1], dof[2])
    perspectiveProj[:3, 3] = dof[3:]
    perspectiveProj[3, 3] = 1

    numPoints = d2dPoints1.shape[0]
    residual = np.empty((2 * numPoints, 3))

    forwardProj = matmul(w2cMatrix, perspectiveProj)
    backwardProj = matmul(w2cMatrix, np.linalg.inv(perspectiveProj))

    Ja = np.empty(3)
    Jb = np.empty(3)
    Wa = np.empty(4)
    Wb = np.empty(4)
    for i in range(numPoints):
        j = i + numPoints

        Ja[2] = Jb[2] = 1.0
        Wa[3] = Wb[3] = 1.0
        Ja[0:2] = d2dPoints1[i, :]
        Jb[0:2] = d2dPoints2[i, :]
        Wa[0:3] = d3dPoints1[i, :]
        Wb[0:3] = d3dPoints2[i, :]

        JaPred = matmul(forwardProj, Wb)
        JaPred /= JaPred[-1]
        residual[i, :] = Ja - JaPred

        JbPred = matmul(backwardProj, Wa)
        JbPred /= JbPred[-1]
        residual[j, :] = Jb - JbPred
    return residual.ravel()

Testing it on some toy data shows significant speed-ups, which become drammatic when using Numba appropriately:

n = 10000
dof = 1, 2, 3, 4, 5, 6
d2dPoints1 = np.random.random((n, 2))
d2dPoints2 = np.random.random((n, 2))
d3dPoints1 = np.random.random((n, 3))
d3dPoints2 = np.random.random((n, 3))
w2cMatrix = np.random.random((3, 4))
x = minReproj_OP(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
y = minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
z = minReproj_nb(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
print(np.allclose(x, y))
# True
print(np.allclose(x, z))
# True


%timeit minReproj_OP(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
# 1 loop, best of 3: 277 ms per loop
%timeit minReproj(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
# 10 loops, best of 3: 177 ms per loop
%timeit minReproj_nb(dof, d2dPoints1, d2dPoints2, d3dPoints1, d3dPoints2, w2cMatrix)
# 100 loops, best of 3: 5.69 ms per loop