Accuracy in depth estimation - Stereo Vision

I am doing a research in stereo vision and I am interested in accuracy of depth estimation in this question. It depends of several factors like:

Proper stereo calibration (rotation, translation and distortion extraction),
image resolution,
camera and lens quality (the less distortion, proper color capturing),
matching features between two images.

Let's say we have a no low-cost cameras and lenses (no cheap webcams etc).

My question is, what is the accuracy of depth estimation we can achieve in this field? Anyone knows a real stereo vision system that works with some accuracy? Can we achieve 1 mm depth estimation accuracy?

My question also aims in systems implemented in opencv. What accuracy did you manage to achieve?

Solution

I would add that using color is a bad idea even with expensive cameras - just use the gradient of gray intensity. Some producers of high-end stereo cameras (for example Point Grey) used to rely on color and then switched to grey. Also consider a bias and a variance as two components of a stereo matching error. This is important since using a correlation stereo, for example, with a large correlation window would average depth (i.e. model the world as a bunch of fronto-parallel patches) and reduce the bias while increasing the variance and vice versa. So there is always a trade-off.

More than the factors you mentioned above, the accuracy of your stereo will depend on the specifics of the algorithm. It is up to an algorithm to validate depth (important step after stereo estimation) and gracefully patch the holes in textureless areas. For example, consider back-and-forth validation (matching R to L should produce the same candidates as matching L to R), blob noise removal (non Gaussian noise typical for stereo matching removed with connected component algorithm), texture validation (invalidate depth in areas with weak texture), uniqueness validation (having a uni-modal matching score without second and third strong candidates. This is typically a short cut to back-and-forth validation), etc. The accuracy will also depend on sensor noise and sensor's dynamic range.

Finally you have to ask your question about accuracy as a function of depth since d=f*B/z, where B is a baseline between cameras, f is focal length in pixels and z is the distance along optical axis. Thus there is a strong dependence of accuracy on the baseline and distance.

Kinect will provide 1mm accuracy (bias) with quite large variance up to 1m or so. Then it sharply goes down. Kinect would have a dead zone up to 50cm since there is no sufficient overlap of two cameras at a close distance. And yes - Kinect is a stereo camera where one of the cameras is simulated by an IR projector.

I am sure with probabilistic stereo such as Belief Propagation on Markov Random Fields one can achieve a higher accuracy. But those methods assume some strong priors about smoothness of object surfaces or particular surface orientation. See this for example, page 14.