python linear-algebra projection-matrix opencv-solvepnp

Where is the returning values of cv2.solvePnP actually holding the camera's position?

There's a cube and a camera in the real world. Each point coordinates are shown in the following figure. It is clear that the camera's coordinate is at [0,0,1].

We can calculate each position's displayed coordinates on screen.

import numpy as np
import cv2
import math

world = np.array(\
[\
( 5.00,  0.00,  0.00), \
( 5.00,  0.00,  0.50), \
( 6.00, -1.00,  0.00), \
( 6.00, -1.00,  0.50), \
( 6.00,  1.00,  0.00), \
( 6.00,  1.00,  0.50), \
( 7.00,  0.00,  0.50), \
])

# Camera Extrinsic Parameter
xRadEuler_C2W = -120 / 180 * math.pi
yRadEuler_C2W = 0 / 180 * math.pi
zRadEuler_C2W = -90 / 180 * math.pi

Rx = np.matrix([[1, 0, 0], [0, math.cos(xRadEuler_C2W), -   math.sin(xRadEuler_C2W)], [0, math.sin(xRadEuler_C2W),  math.cos(xRadEuler_C2W)]])
Ry = np.matrix([[ math.cos(yRadEuler_C2W), 0, math.sin(yRadEuler_C2W)], [0, 1, 0], [-math.sin(yRadEuler_C2W), 0, math.cos(yRadEuler_C2W)]])
Rz = np.matrix([[ math.cos(zRadEuler_C2W), -math.sin(zRadEuler_C2W), 0], [ math.sin(zRadEuler_C2W),  math.cos(zRadEuler_C2W), 0], [0, 0, 1]])

# Notice : Rotation Matrix from Euler Angle.
R = Rx * Ry * Rz

# tvec is expressed wrt camra coord.
tvec = R * np.matrix((0, 0, -1)).T

# Camera Intrinsic Paramter
dist_coeffs = np.zeros((5, 1))
width = 640
height = 480
focal_length = 160
center = (width / 2, height / 2)

camera_matrix = np.array([[focal_length, 0, center[0]], 
                          [0, focal_length, center[1]],
                          [0, 0, 1]], dtype = "double")

if __name__ == "__main__":
  print("\nProject Point on Screen")
  result = cv2.projectPoints(world, rvec, tvec, camera_matrix, None)

  for n in range(len(world)):
    print(world[n], '==>', result[0][n])

As a result, we get

Project Point on Screen
[ 5.   0.  0.   ] ==> [[ 320.                  294.12609945]]
[ 5.   0.  0.5 ] ==> [[ 320.                  312.2071607]]
[ 6.  -1.  0.   ] ==> [[ 291.91086401  299.94150262]]
[ 6.  -1.  0.5 ] ==> [[ 290.62146125  315.41433581]]
[ 6.   1.  0.   ] ==> [[ 348.08913599  299.94150262]]
[ 6.   1.  0.5 ] ==> [[ 349.37853875  315.41433581]]
[ 7.   0.  0.5 ] ==> [[ 320.                  317.74146755]]

Now, I would like to calculate the camera's position, which we defined to be at [0,0,1].

import numpy as np
import cv2
import math

world = np.array(\
[\
( 5.00,  0.00,  0.00), \
( 5.00,  0.00,  0.50), \
( 6.00, -1.00,  0.00), \
( 6.00, -1.00,  0.50), \
( 6.00,  1.00,  0.00), \
( 6.00,  1.00,  0.50), \
( 7.00,  0.00,  0.50), \
])

img_pnts = np.array(\
[\
(320.                , 294.12609945), \
(320.                , 312.2071607), \
(291.91086401, 299.94150262), \
(290.62146125, 315.41433581), \
(348.08913599, 299.94150262), \
(349.37853875, 315.41433581), \
(320.                , 317.74146755), \
])

# Camera Intrinsic Paramter
dist_coeffs = np.zeros((5, 1))
width = 640
height = 480
focal_length = 160
center = (width / 2, height / 2)

camera_matrix = np.array(
                    [[focal_length, 0, center[0]], 
                    [0, focal_length, center[1]],
                    [0, 0, 1]], dtype = "double"
                    )

if __name__ == "__main__":
  (success, rot_vec, trans_vec) = cv2.solvePnP(world, img_pnts, camera_matrix, dist_coeffs, flags=cv2.SOLVEPNP_ITERATIVE)

  print("\nTranslation Vector")
  print(trans_vec)

  print("\nRotation Vector")
  print(rot_vec)

  print("\nRotation Matrix")
  R, jacob = cv2.Rodrigues(rot_vec)
  print(R)

The result looks something like this.

Translation Vector
[[ -8.87481507e-11]
 [ -8.66025403e-01]
 [  4.99999999e-01]]

Rotation Vector
[[-1.58351453]
 [-1.58351453]
 [-0.91424254]]

Rotation Matrix
[[  1.53020929e-11   1.00000000e+00  -5.93469718e-13]
 [  5.00000000e-01  -7.13717974e-12   8.66025404e-01]
 [  8.66025404e-01  -1.35487732e-11  -5.00000000e-01]]

Where did [0,0,1] go?

Disclaimer: I borrowed the figures and codes from this article.

Solution

The returned pose (translation and rotation) is not the camera position with respect to the world frame. It is the reversed, i.e the position of the world origin with respect to the camera frame.

Given a vector X_{cam} expressed in camera frame coordinates and the corresponding vector X_{world} in world coordinates, you have :

$$X_{cam} = R*X_{world} + tvec$$

Since the camera position is $(0,0,0)$ in camera frame and $C$ in world frame, then you have to solve for $0=R*C + tvec$ i.e $C= -R^(-1)*tvec$ which is the camera center.

In other words, to get the position of the camera wrt the world frame, you need to take the opposite translation vector tvec and then rotate it using the inverse (or transposed) rotation matrix.

You should obtain the following position, which is roughly [0,0,1]

-(R.T)@tvec = array([[3.66025377e-10], [8.93415583e-11], [9.99999999e-01]])