GLSL truncated signed distance representation (TSDF) implementation

I am looking forward to implement a model reconstruction of RGB-D images. Preferred on mobile phones. For that I read, it is all done with an TSDF-representation. I read a lot of papers now over hierarchical structures and other ideas to speed this up, but my problem is, that I still hat no clue how to actually implement this representation.

If I have a volume grid of size n, so n x n x n and I want to store in each voxel the signed distance, weight and color information. My only guess is, that I have to build a discrete set of points, for each voxel position. And with GLSL "paint" all these points and calculate the nearest distance. But that don't seem quite good or efficient to calculate this n^3 times.

How can I imagine to implement such a TSDF-representation?

The problem is, my only idea is to render the voxel grid to store in the data of signed distances. But for each depth map I have to render again all voxels and calculate all distances. Is there any way to render it the other way around?

So can't I render the points of the depth map and store informations in the voxel grid?

How is the actual state of art to render such a signed distance representation in an efficient way?

Solution

You are on the right track, it's an ambitious project but very cool if you can do it.

First, it's worth getting a good idea how these things work. The original paper identifying a TSDF was by Curless and Levoy and is reasonably approachable - a copy is here . There are many later variations but this is the starting point.

Second, you will need to create nxnxn storage as you have said. This very quickly gets big. For example, if you want 400x400x400 voxels with RGB data and floating point values for distance and weight then that will be 768MB of GPU memory - you may want to check how much GPU memory you have available on a mobile device. Yup, I said GPU because...

While you can implement toy solutions on CPU, you really need to get serious about GPU programming if you want to have any kind of performance. I built an early version on an Intel i7 CPU laptop. Admittedly I spent no time optimising it but it was taking tens of seconds to integrate a single depth image. If you want to get real time (30Hz) then you'll need some GPU programming.

Now you have your TSFD data representation, each frame you need to do this:

1. Work out where the camera is with respect to the TSDF in world coordinates. Usually you assume that you are the origin at time t=0 and then measure your relative translation and rotation with regard to the previous frame. The most common way to do this is with an algorithm called iterative closest point (ICP) You could implement this yourself or use a library like PCL though I'm not sure if they have a mobile version. I'd suggest you get started without this by just keeping your camera and scene stationary and build up to movement later.

2. Integrate the depth image you have into the TSDF This means updating the TSDF with the next depth image. You don't throw away the information you have to date, you merge the new information with the old. You do this by iterating over every voxel in your TSDF and :

a) Work out the distance from the voxel centre to your camera

b) Project the point into your depth camera's image plane to get a pixel coordinate (using the extrinsic camera position obtained above and the camera intrinsic parameters which are easily searchable for Kinect)

c) Lookup up the depth in your depth map at that pixel coordinate

d) Project this point back into space using the pixel x and y coords plus depth and your camera properties to get a 3D point corresponding to that depth

e) Update the value for the current voxel's distance with the value distance_from_step_d - distance_from_step_a (update is usually a weighted average of the existing value plus the new value).

You can use a similar approach for the voxel colour.

Once you have integrated all of your depth maps into the TSDF, you can visualise the result by either raytracing it or extracting the iso surface (3D mesh) and viewing it in another package.

A really helpful paper that will get you there is here. This is a lab report by some students who actually implemented Kinect Fusion for themselves on a PC. It's pretty much a step by step guide though you'll still have to learn CUDA or similar to implement it

You can also check out my source code on GitHub for ideas though all normal disclaimers about fitness for purpose apply.

Good luck!