I want to align the feature map using ego motion, as mentioned in the paper An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds
I use VoxelNet as backbone, which will shrink the image for 8 times. The size of my voxel is 0.1m x 0.1m x 0.2m(height)
So given an input bird-eye-view image size of 1408 x 1024
,
the extracted feature map size would be 176 x 128
, shrunk for 8 times.
The ego translation of the car between the "images"(point clouds actually) is 1 meter in both x and y direction. Am I right to adjust the feature map for 1.25 pixels?
1m/0.1m = 10 # meter to pixel
10/8 = 1.25 # shrink ratio of the network
However, though experiments, I found the feature maps align better if I adjust the feature map with only 1/32 pixel for the 1 meter translation in real world.
Ps. I am using the function torch.nn.functional.affine_grid
to perform the translation, which takes a 2x3 affine matrix as input.
It's caused by the function torch.nn.functional.affine_grid
I used.
I didn't fully understand this function before I use it.
These vivid images would be very helpful on showing what this function actually do(with comparison to the affine transformations in Numpy.