Search code examples
matlabneural-networkdeep-learningobject-detection

How important is the Input Size for Deep Learning Architectures?


Recently, I've been playing with the MATLAB's RCNN deep learning example here. In this example, MATLAB has designed a basic 15 layer CNN with the input size of 32x32. They use the CIFAR10 dataset to pre-train this CNN. The CIFAR10 dataset has training images of size 32x32 too. Later they use a small dataset of stop signs to fine tune this CNN to detect stop signs. This small dataset of stop signs has only 41 images; so they use these 41 images to fine tune the CNN and namely train an RCNN network. this is how they detect a stop sign: enter image description here As you see the bounding box almost covers the whole stop signs except a small part on the top. Playing with the code I decided to fine tune the same network pre-trained on the CIFAR10 dataset with the PASCAL VOC dataset but only for the "aeroplane" class. These are some results I get: result 1

result 2

As you see the detected bounding boxes barely cover the whole airplane; so this causes the precision to be 0 later when I evaluate them. I understand that in the original RCNN paper mentioned in the MATLAB example the input size 227x227 and their CNN has 25 layers. Could this be why the detections are not accurate? How does the input size of a CNN affect the end result?


Solution

  • almost surely yes! when you pass an image through a net, the net tries to minimize the data taken from the image until it gets the most relevant data. during this process, the input shrinks again and again. If, for example, you insert to a net an image that smaller than the wanted, all the data from the image may lost during the pass in the net. In your case, an optional reason to your results is that the net "looks for" features in limited resolution and maybe the big airplane has over high resolution.