Reconstructing 3D locations of corresponding points

I'm working on a project where I would like to reconstruct the 3D locations of feature points I've extracted from my camera images. The idea is to:

Make a camera recording (Greyscale information, VGA size: 640 x 480)
Extract feature points in the camera frames (I'm using SIFT for this)
Correspond features from frame[k-1] with features from frame[k] (I intend to use RANSAC for this, more on that later...)
Calculate/estimate some relative distance information between these feature points (this would be in some (x,y,z) coordinate system)

I've read in many papers that RANSAC is an algorithm that is used in reconstruction, with the end result being some kind of point cloud. I want to be able to do just that. However, I've ran into a few snags, and I hope you guys can help me out with these.

The first snag is that I do not really understand how I would be able to use RANSAC to perform this point correspondence. I understand the concept of RANSAC being a model-fitting tool, I just don't see how it could be used for doing correspondence solving.

The second snag is, assuming I have my correspondence information, how to get some kind of distance information between all these points. I've read that perspective projection could be used to solve this, and in turn one should try to estimate the Fundamental Matrix. Then do some math magic to be able to get the point cloud. Point is, I don't understand what the actual values in a Fundamental Matrix mean. I know it's gives a mathematical relation between the position of 2 cameras (or in my case, 2 frames in a video there the camera is moving), and that it exploits epipolar geometry. But besides this, I just don't have a clue what the Fundamental Matrix actually entails. How is this 3x3 matrix capturing the 6DOF of 1 camera with respect to another? Also I think the 'math magic' I referred to are some kind of matrix multiplications, but I haven't found any informational source to explain me what it does and what the formulation is.

Therefore, my question is: Could anyone of you point me into the right direction? I've been digging through the references of the papers I've read so far, but these also give me the "we solve this using the RANSAC algorithm"-line and I'm getting more the feeling I'm looking in the wrong direction. Is there some nice explanation of these things, perhaps in laymen's terms and/or with some illustrations? In short: where should I be looking or where can I find this elusive piece of information?

Thanks in advance, Xilconic

PS: Checked wikipedia, but it's not helping me much. Also listened to the 'Fundamental Matrix Song', and it's the same story.

Solution

Wrote my thesis on this, also using the RANSAC algorithm in my paper.

There is more to this topic than can be captured in a few paragraphs here. Consider getting the excellent book Multiple View Geometry.

Snag 1

RANSAC will find a model, in this case the fundamental matrix F, even in the presence of huge amount of outliers. In this case, some point-correspondance candidates are way off. This is an outlier. Basically you just keep fitting the F matrix from randomly drawn points. Eventually you find some set of points that together creates a consistent model. These are the inliers. They can now be used to estimate the model (F) more accurately.

There is an easy example in my paper with a line-fitting example to get you started and a easy-to-grasp explanation of RANSAC applied to the correspondance problem.

Snag 2

The most important thing about the F matrix is that it maps a point in one image to a line in the other:

Fx = l' where x is a point in one image and l' is a line in the other.

The F matrix has 9 elements but must have rank 2 and also the scale does not matter, thus it has only 7 degrees of freedom. There is no easy explanation for the elements of the F matrix.

Using a point correspondance x <-> x' and F the world 3d coordinate, X, of the depicted point can be extracted if you know the cameras internal parameters, like focal length.

Note that when using consecutive movie frames the camera usually moves very little and it might be hard to compute the fundamental matrix. It can be worked around though. I suggest looking into the works of Marc Pollefeys'