Estimate the camera pose in the reference system using one marker with ARUCO

I am currently working on a camera pose estimation project using only one marker with ARUCO.

I used Aruco's Marker Detector to detect markers and get the marker's Rvec and Tvec. I understand these two vectors represent the transform from the marker to the camera, which is the marker's pose w.r.t camera. I form a 4 by 4 matrix called T_marker_camera using these two vectors.

Then, I set up a world frame (left handed) and get the marker's world pose, which is a 4 by 4 transform matrix.

I want to calculate the pose of the camera w.r.t the world frame, and I use the following formula to calculate it: T_camera_world = T_marker_world * T_marker_camera_inv

Before I perform the above formula, I convert the OpenCV coordinates to the left handed one (flip the sign of x axis).

However, I didn't get the correct x, y, z of the camera w.r.t the world frame.

What did I miss to get the correct answer?

Thanks

Solution

The one equation you gave looks right, so the issue is probably somewhere that you didn't show/describe.

A fix in your notation will help clarify.

Write the pose/source frame on the right (input), the reference/destination frame on the left (output). Then your matrices "match up" like dominos.

rvec and tvec yield a matrix that should be called T_cam_marker.

If you want the pose of your camera in the world frame, that is

T_world_cam = T_world_marker * T_marker_cam
T_world_cam = T_world_marker * inv(T_cam_marker)

(equivalent to what you wrote, but domino)

Be sure that you do matrix multiplication, not element-wise multiplication.

To move between left-handed and right-handed coordinate systems, insert a matrix that maps coordinates accordingly. Frames:

OpenCV camera/screen: right-handed, {X right, Y down, Z far}
ARUCO (in OpenCV anyway): right-handed, {X right, Y far, Z up}, first corner is top left (-X+Y quadrant)
whatever leftie frame you have, let's say {X right, Y up, Z far} and it's a screen or something

The hand-change matrix for typical frames on screens is an identity but with the entry for Y being a -1. I don't know why you would flip the X but that's "equivalent", ignoring any rotations.