I'm currently trying to reconstruct camera poses using OpenCV. For the implementation, I'm roughly following the Epipolar Geometry Example. The idea is the following:
cv::xfeatures2d::SURF::detectAndCompute()
cv::DescriptorMatcher::knnMatch()
cv::findEssentialMat()
cv::recoverPose()
As input data, I'm using a sequence of pictures from the freely available New Tsukuba Stereo Dataset. This dataset is an artificially created series of stereo camera frames with known ground truth poses. To test my implementation I estimated the relative pose between the 1st and 60th left picture and the pose of the 1st and 60th right picture. The translation vectors for both pairs should point in roughly the same direction since both pairs are corresponding pictures from the left and right camera. Looking at the ground truth, the magnitude of rotation of the two poses should be roughly 7 degrees. When I run my implementation I, unfortunately, get the following poses:
First pair:
-> Rotation vector: [[0.3406, 0.9054, 0.2534]]
-> Rotation angle: 10.975deg
-> Translation vector: [[-0.8103, 0.04748, -0.5841]]
Second pair:
-> Rotation vector: [[0.7907, 0.5027, 0.3494]]
-> Rotation angle: 5.24811deg
-> Translation vector: [[0.748, 0.2306, -0.6223]]
My results are all over the place and I'm not quite sure what's wrong. Both rotations aren't close to 7 degrees and the translation vectors point in completely different directions. I have created a minimal code example that demonstrates my process:
namespace fs = std::filesystem;
const fs::path ROOT_DIR = "NewTsukubaStereoDataset";
const cv::Matx33d CAMERA_MAT(615, 0, 640, 0, 615, 480, 0, 0, 1);
constexpr double HESSIAN_THRESHOLD = 400;
constexpr double LOWE_THRESHOLD = 0.8;
constexpr double RANSAC_CONFIDENCE = 0.999;
constexpr double MAX_DIST_TO_EPIPOLAR = 1;
constexpr int MAX_RANSAC_ITERS = 1000;
constexpr std::size_t MIN_INLIERS = 100;
std::optional<cv::Affine3d> reconstructPose(const cv::Mat& firstPic, const cv::Mat& secondPic, const double scale) {
// initialize data structures
std::vector<cv::KeyPoint> firstKeyPoints, secondKeyPoints;
cv::Mat firstDescriptors, secondDescriptors, inlierMask;
cv::Matx33d essentialMat, rotation;
cv::Vec3d translation;
std::vector<std::vector<cv::DMatch>> knnFeatureMatches;
std::vector<cv::Point2f> firstInlierPts, secondInlierPts;
// initialize algorithms
cv::Ptr<cv::xfeatures2d::SURF> detector = cv::xfeatures2d::SURF::create(HESSIAN_THRESHOLD);
cv::Ptr<cv::DescriptorMatcher> matcher = cv::DescriptorMatcher::create(cv::DescriptorMatcher::FLANNBASED);
// compute features
detector->detectAndCompute(firstPic, cv::noArray(), firstKeyPoints, firstDescriptors);
detector->detectAndCompute(secondPic, cv::noArray(), secondKeyPoints, secondDescriptors);
// find matching features
matcher->knnMatch(firstDescriptors, secondDescriptors, knnFeatureMatches, 2);
// ratio test as per Lowe's paper (copied from the opencv example)
for(std::size_t i = 0; i < knnFeatureMatches.size(); ++i) {
if(knnFeatureMatches[i][0].distance < LOWE_THRESHOLD * knnFeatureMatches[i][1].distance) {
const cv::DMatch& m = knnFeatureMatches[i][0];
firstInlierPts.push_back(firstKeyPoints[m.queryIdx].pt);
secondInlierPts.push_back(secondKeyPoints[m.trainIdx].pt);
}
}
// require a minimum number of inliers for effective ransac execution
if(firstInlierPts.size() < MIN_INLIERS) {
std::cerr << "Not enough inliers for essential matrix estimation" << std::endl;
return std::nullopt;
}
// estimate essential matrix
essentialMat = cv::findEssentialMat(firstInlierPts, secondInlierPts, CAMERA_MAT, cv::RANSAC,
RANSAC_CONFIDENCE, MAX_DIST_TO_EPIPOLAR, MAX_RANSAC_ITERS, inlierMask);
// require minimum number of valid inliers as well as a valid essential matrix (see https://en.wikipedia.org/wiki/Essential_matrix#Properties)
if(!isValidEssentialMatrix(essentialMat) || cv::sum(inlierMask)(0) < MIN_INLIERS) {
std::cerr << "Invalid essential matrix" << std::endl;
return std::nullopt;
}
// estimate pose from the essential matrix
const std::size_t numPoints = cv::recoverPose(essentialMat, firstInlierPts, secondInlierPts, CAMERA_MAT, rotation, translation, inlierMask);
// recoverPose returns a unit length translation that needs to be scaled accordingly
translation *= scale;
// require minimum number of valid inliers as well as a valid rotation matrix
if (isValidRotationMatrix(rotation) && numPoints >= MIN_INLIERS) {
displayDebugPicture(firstPic, secondPic, inlierMask, firstInlierPts, secondInlierPts);
return cv::Affine3d(rotation, translation);
} else {
std::cerr << "Invalid estimated pose" << std::endl;
return std::nullopt;
}
}
int main(int argc, char* argv[]) {
// loading the data
const cv::Mat left0 = cv::imread(ROOT_DIR / "illumination" / "fluorescent" / "L_00001.png", cv::IMREAD_GRAYSCALE);
const cv::Mat left1 = cv::imread(ROOT_DIR / "illumination" / "fluorescent" / "L_00060.png", cv::IMREAD_GRAYSCALE);
const cv::Mat right0 = cv::imread(ROOT_DIR / "illumination" / "fluorescent" / "R_00001.png", cv::IMREAD_GRAYSCALE);
const cv::Mat right1 = cv::imread(ROOT_DIR / "illumination" / "fluorescent" / "R_00060.png", cv::IMREAD_GRAYSCALE);
// reconstruct first pose (rotation angle should be around 7deg)
std::cout << "Left pair:" << std::endl;
std::optional<cv::Affine3d> pose0 = reconstructPose(left0, left1, 1);
if(pose0.has_value()) {
printAffine3d(pose0.value()); // prints the pose like I mentioned it above
}
// reconstruct second pose (rotation angle should be around 7deg)
std::cout << "Right pair:" << std::endl;
std::optional<cv::Affine3d> pose1 = reconstructPose(right0, right1, 1);
if(pose1.has_value()) {
printAffine3d(pose1.value()); // prints the pose like I mentioned it above
}
return EXIT_SUCCESS;
}
As you can see in the code above, I implemented a few debugging mechanisms to make sure the process works as intended. First of all, I made sure that there is a reasonable amount of matching features to be able to estimate the essential matrix and the relative poses:
if(firstInlierPts.size() < MIN_INLIERS) {
std::cerr << "Not enough inliers for essential matrix estimation" << std::endl;
return std::nullopt;
}
Aside from checking the number of valid inlier matches, I also check the validity of the essential matrix by making sure that it contains two equal singular values and one that is zero. I also validate the output of cv::recoverPose()
to make sure that it is a rotation matrix:
constexpr double EPS = 1e-5;
bool isValidEssentialMatrix(const cv::Matx33d& mat) {
cv::Vec3d singularVals;
cv::SVD::compute(mat, singularVals, cv::SVD::NO_UV);
int zeroInd = -1;
for(int i = 0; i < 3; ++i) {
if(std::abs(singularVals(i)) <= EPS) {
zeroInd = i;
break;
}
}
if(zeroInd == 0) {
return std::abs(singularVals(1) - singularVals(2)) <= EPS;
} else if(zeroInd == 1) {
return std::abs(singularVals(0) - singularVals(2)) <= EPS;
} else if(zeroInd == 2) {
return std::abs(singularVals(0) - singularVals(1)) <= EPS;
} else {
return false;
}
}
bool isValidRotationMatrix(const cv::Matx33d& mat) {
cv::Matx33d tmp;
cv::absdiff((mat.t() * mat), cv::Matx33d::eye(), tmp);
return cv::sum(tmp)(0) <= EPS && std::abs(cv::determinant(mat) - 1) <= EPS;
}
To make sure that the matching features actually make sense, I also displayed them in a side-by-side view and verified that they actually represent corresponding points in the pictures:
At this point, I'm running out of ideas on what other things I could do to find out why my code is not working. Am I missing something here or am I using OpenCV incorrectly? Why do I get nonsense data for the poses here? Is there a bug in the OpenCV code itself? (I'm using version 4.7.0)
We know about the dataset:
The resolution of the images is 640x480 pixels, the baseline of the stereo camera is 10cm and the focal length of the camera is 615 pixels.
The camera matrix contains the focal length and the optical center.
The optical center is usually placed in the center of the image. The ideal values are cx = (width-1) / 2
and cy = (height-1) / 2
.
Minus one half because pixel centers are integer, and if you imagine an image that is 4 pixels wide, the center would sit between the middle two pixels, which would be the coordinate 1.5 = (4-1) / 2.
In the code, this camera matrix is given:
const cv::Matx33d CAMERA_MAT( // good for images sized 1280 x 960
615, 0, 640,
0, 615, 480,
0, 0, 1);
This would imply an image sized roughly 1280 by 960.
For a 640 by 480 image, a more sensible matrix would instead be
const cv::Matx33d CAMERA_MAT( // good for images sized 640 x 480
615, 0, (640-1)/2,
0, 615, (480-1)/2,
0, 0, 1);
A projection matrix with the optical center in the bottom right corner is highly unusual but not impossible. It would arise from cropping a 1280 x 960 sensor image down to the top left quarter. Such cropping (leaving left and top as is) would not move the optical center.
By the way, assuming f = 615, you can calculate the field of view. These derive from the projection matrix, with some simplification:
tan(theta/2) * f = width/2
=> HFoV ~55.0°tan(theta/2) * f = height/2
=> VFoV ~42.6°tan(theta/2) * f = hypot(width,height)/2
=> DFoV ~66.1°That is, without taking lens distortion into account.
Since this is a stereo pair, you might plan to calculate a disparity map. Disparity and distance (along Z, not euclidean) are related by the "baseline", a kind of inter-pupillary distance (IPD).
We have a triangle spanned by the eyes and the 3D point, where you can assume (w.l.o.g.) one eye looking directly at it, and the other "missing" the point by some distance in pixels.
The equations on that triangle:
tan(alpha) * f [px] = disparity [px]
tan(alpha) * distance [m] = baseline [m]
Combined:
distance [m] * disparity [px] = baseline [m] * f [px]
Rearrange to taste.
Given all that, we can also calculate the displacement in X/Y of some point, given its distance (along Z, not euclidean) and screen coordinate (relative to optical center).
X[m] / Z[m] * f + cx = x[px]
X[m] = (x[px] - cx) / f * Z[m]
So if there's a point one meter away and its screen coordinate is x = 300
, then x-cx = 300-319.5 = -19.5
, and X [m] = -0.0317 [m]
. If it's ten meters away, that would be X[m] = -0.317 [m]