opencvcomputer-visiontriangulationstructure-from-motion

Understanding cv2.recoverPose's coordinate frame transformations


I am having problems understanding the functionality of cv2.recoverPose(points1, points2). As of my understanding of the documentation, this function should return the rotation matrix from camera1 to camera2 as well as the translation from camera1 to camera2.

I am using an artificial set of points, points_3d, as well as the transformation between two cameras, T_1_0, to recover the camera transformation in Python. To get the points on the cameras, I calculate the projections of the 3D points onto the camera sensors:

import cv2
import numpy as np

def calc_projection(K_c, transform, pt_3d):
    p_projected = np.hstack((K_c, np.zeros((3,1)))) @ np.linalg.inv(np.vstack((transform, [0,0,0,1]))) @ np.vstack((pt_3d.reshape(3,1), 1))
    p_projected = p_projected[:2,:] / p_projected[2,:]
    p_projected = p_projected.ravel()
    return p_projected

points_3d = np.random.rand(100, 3)

K_c = np.eye(3)

T_0 = np.hstack((np.eye(3), np.zeros((3,1))))
rot_vec = np.array([0.2, 0.1, 0.3])
R_1_0, _ = cv2.Rodrigues(np.array(rot_vec))
t_0_10 = np.array([0.2, 0.4, 0.1])
T_1_0 = np.hstack((R_1_0, t_0_10.reshape(3,1)))
points1 = []
points2 = []

for pt_3d in points_3d:
    points1.append(calc_projection(K_c, T_0, pt_3d))
    points2.append(calc_projection(K_c, T_1_0, pt_3d))

points1 = np.array(points1)
points2 = np.array(points2)

E, mask = cv2.findEssentialMat(points1, points2, method=cv2.RANSAC)
inliers1 = points1[mask]
inliers2 = points2[mask]
_, R, t, _ = cv2.recoverPose(E, inliers1, inliers2)

r, _ = cv2.Rodrigues(R)
assert np.allclose(r, rot_vec)
assert np.allclose(t, t_0_10)

I would expect the result to be equal to T_1_0 (as of the assertion) but the result is:

r = [[0.20329041]
 [0.15711541]
 [0.37188371]]
t = [[0.50969714]
 [0.79593836]
 [0.32663581]]

What am I missing here? Why is it not working as expected? Am I doing something wrong or what is the expected behavior here?

Edit

The formula I've used for the projection wrongly introduces the inverse of the transformation. It should be the following instead:

p_projected = np.hstack((K_c, np.zeros((3,1)))) @ np.vstack((transform, [0,0,0,1])) @ np.vstack((pt_3d.reshape(3,1), 1))

Side Note

solvePnP also solves the problem that I've been trying to solve here (3D->2D projection)

_, r, t = cv2.solvePnP(points_3d, points2, K_c, None)
assert np.allclose(r, rot_vec.reshape(3, 1), rtol=0.15)
assert np.allclose(t, t_0_10.reshape(3, 1), rtol=0.15)

But I still don't know why cv2.recoverPose does not work? Looking at the documentation it should also return the translation and rotation...


Solution

  • TL;DR

    Most important findings:

    • The cv2.recoverPose(points1, points2,...) function returns the rotation from camera1 to camera2 if points1 are found on camera1 and points2 are found on camera2.

    • The returned translation vector is also from camera1 to camera2 but in the coordinate frame of camera1.

    • The translation vector is found only up to a factor which can't be recovered without further logic.

    • Setting the cameraMatrix on findEssentialMat is important.

    Description

    I am now able to use cv2.recoverPose and recover the right rotation and translation. Here is the updated code:

    E, mask = cv2.findEssentialMat(points1, points2, cameraMatrix=K_c)
    inliers1 = points1[mask]
    inliers2 = points2[mask]
    _, R, t, _ = cv2.recoverPose(E, inliers1, inliers2, cameraMatrix=K_c)
    
    r, _ = cv2.Rodrigues(R)
    

    The results are

    # As expected
    r = [[0.2]
     [0.1]
     [0.3]]
    
    # expected: [0.2, 0.4, 0.1], but is:
    t = [[0.43643578]
     [0.87287156]
     [0.21821789]]
    

    But! The documentation tells that the recovery of the translation is only possible up to a factor! So in this case, the following assertion works and the behavior is as expected!:

    factors = t.ravel() / t_0_10 
    assert np.allclose(factors, factors[0])
    

    Comment: What's funny is that if I use K_c = np.eye(3) it sometimes works and sometimes not. But if I use the intrinsics of a real camera, the assertion always is true...

    K_c = np.array([
        [485.0, 0.0, 320.0],
        [0.0, 485.0, 240.0],
        [0.0,0.0,1.0]
    ])