Why put the whole image in a tfrecord file? Why not just crop according to the bounding-box and put the cropped object in the tfrecord file?

Why do we put the whole image in a tfrecord file? Why not just crop the image according to the bounding-box and put the cropped object in the tfrecord file? This should greatly reduce the size of that file.

Solution

Because you want to learn to detect where that object is in the image. In image classification, you would cut out the images as you proposed and the network would output "car" or "not car". In object detection, the network will output the bounding boxes for the objects along with the class. ("car is at x1-x2-y1-y2") It learns by having the whole picture with the bounding boxes for the loss function.