Search code examples
object-detectionobject-detection-api

MobileNet-SSD input resolution


I have a working object detection model (fined-tuned MobileNet SSD) that detects my custom small robot. I'll feed it some webcam footage (which will be tied to a drone) and use the real-time bounding box information.

So, I am about to purchase the camera.

My questions: since SSD resizes the input images into 300x300, is the camera resolution very important? Does higher resolution mean better accuracy (even when it gets resized to 300x300 anyway)? Should I crop the camera footage into 1:1 aspect ratio at every frame before running my object detection model on it? Should I divide the image into MxN cropped segments and run inference one by one?

Because my robot is very small and the drone will be at a 4 meter altitude, so I'll effectively be trying to detect a very tiny spot on my input image.

Any sort of wisdom is greatly appreciated, thank you.


Solution

  • These are quite a few questions, I'll try to answer all of them. The detection model resizes the input images before feeding it to the network by some resizing method, e.g. bilinear. It would be better of course if the input image would be equal or larger than the input size to the network rather than smaller. A rule of thumb is that indeed higher resolution means better accuracy, but it highly depends on the setup and the task. If you're trying to detect a small object, and let's say for example that the original resolution is 1920x1080. Then after resizing the image, the small object would be even smaller (pixels-wise), and might be too small to detect. Therefore, indeed, it would be better to either split the image to smaller images (possibly with some overlap to avoid misdetection due to object splitting) and applying detection on each, or using a model with higher input resolution. Be aware that while the first is possible with your current model, you'll need to train a new model possibly with some architectural changes (e.g. adding SSD layers and modifying anchors, depends on the scales you want to detect) for the latter. Regarding the aspect ratio matter, you mostly need to stay consistent. It doesn't matter if you don't keep the original aspect ratio, but if you don't - do it both in training and evaluation/test/deployment.