Search code examples
pythonimagetensorflowclassificationdetection

How do you do ROI-Pooling on Areas smaller than the target size?



I am currently trying to get the Faster R-CNN network from here to work in windows with tensorflow. For that, I wanted to re-implement the ROI-Pooling layer, since it is not working in windows (at least not for me. If you got any tips on porting to windows with tensorflow, I would highly appreciate your comments!). According to this website, what you do is, you take your proposed roi from your feature map and max pool its content to a fixed output size. This fixed output is needed for the following fully connected layers, since they only accept a fixed size input.


The problem now is the follwing:

After conv5_3, the last convolutional layer before roi pooling, the box that results from the region proposal network is mostly 5x5 pixels in size. This is totally fine, since the objects I want to detect usually have dimensions of 80x80 pixels in the original image (downsampling factor due to pooling is 16). However, I now have to max pool an area of 5x5 pixels and ENLARGE it to 7x7, the target size for the ROI-Pooling. My first try by simply doing interpolation did not work. Also, padding with zeros did not work. I always seem to get the same scores for my classes.

Is there anything I am doing wrong? I do not want to change the dimensions of any layer and I know that my trained network in general works because I have the reference implementation running in Linux on my dataset.

Thank you very much for your time and effort :)


Solution

  • There is now an official TF implementation of Faster-RCNN, and other object detection algorithms, in their Object Detection API, you should probably check it out.

    If you still want to code it yourself, I wondered exactly the same thing as you and could not find an answer about how you're supposed to do. My three guesses would be:

    • interpolation, but it changes the feature values, so it destroys some information...

    • Resizing to 35x35 just by copying 7 times each cell and then max-pooling back to 7x7 (you don't have to actually do the resizing and then the pooling , for instance in 1D it basically reduces itself to output[i]=max(input[floor(i*5/7)], input[ceil(i*5/7)]), with a similar max over 4 elements in 2D -be careful, I might have forgotten some +1/-1 or something-). I see at least two problems: some values are over-represented, being copied more than others; but worse, some (small) values will not even be copied at all in the output ! (which you should avoid given that you can store more information in the output than in the input)

    • Making sure all input feature values are copied at least once exactly in the output, at the best possible place (basically copy input[i] to output[j] with j=floor((i+1)*7/5)-1)). For the remaining spots, either leave a 0 or do interpolation. I would think this solution is the best, maybe with interpolations but I'm really not sure at all.

    It looks like smallcorgi's implementation uses my 2nd solution (without actually resizing, just using max pooling), since it's the same implementation as for the case where the input is bigger than the output.