tensorflow object-detection backpropagation tensorflow-datasets tfrecord

How object detection training works while backpropagating?

I'm using tensorflow to train a f-rcnn inception v2 model.

Let's say I have 6000 images:

from img 1 to 3000: each image has both a dog and a cat, but I only labeled the dog.
from img 3001 to 6000: each image has both a dog and a cat, but I only labeld the cat.

So each image has a dog and a cat in it, but I only labeled the dog in half of them, and labeled the cat in the other half.

When creating the dataset I don't shuffle the images, so I'll have the first 3000 imgs labeled with dogs, then the other 3000 imgs labeled with cats.

My questions are:

Does the order of the images affect the result? Would it change if I create a dataset with the dogs first then the cats? Would that be different if I shuffle the data so I mix cats and dogs?
When backpropagating, does the fact that I didn't label the cats while labeled the dog and viceversa affect the result? Does the model unlearned because I have dogs and cats unlabeled? Do I get the same result as having 3000 images with both dog and cat labeled for each image?
The reason I don't label both dogs and cats in each image is because I have images of a fixed camera where sometimes you see different dogs or the same dog moving around, while a cat is sleeping. So labeling the sleeping cat every time would mean having the same image as input multiple times. (and of course it takes a lot of time to label). How could I solve this? Should I crop the image before creating the dataset? Is it enogh if I create an eval dataset where I have both dogs and cats labeled in each image and a train dataset where I only have the object (dog) label and not the cat?

Thanks

Solution

1- Yes the order of images affects the result[1], and more significantly it will affect the speed at which your algorithm will learn. In its essence your algorithm is trying to learn a configuration of weights which minimise your loss function for all examples which you show to it. It does this by arranging these weights into a configuration which detects those features in the data which discriminate between cats and dogs. But it does this only by considering one batch of inputs at a time. Each image in a batch is considered individually and back-prop decides how the weights should be altered so that the algorithm better detects the cat/dog in that image. It then averages all of these alterations for every image in the batch and makes that adjustment.

If your batch contains all of your images then the order would not matter; it would make the adjustment that it expects will provide the greatest net reduction in your loss function for all data. But if the batch contains less than all data (which it invariably does) then it makes an adjustment that helps detect the dogs/cats only in the images in that batch. This means if you show it more cats than dogs, it will decide that a feature belonging equally to both cats and dogs actually produces an increased probability that the animal in question is a cat, which is false. Because in instances where that feature was detected a higher probability were cats. This will correct itself over time as the ratio of cats:dogs evens out, but will arrive at its final configuration much more slowly, becase it has to learn and unlearn non-helpful features in the data.

As an example, in your setup by the time your algorithm has observed half of the data, all it has learned is that "all things that look like a cat or a dog are dogs". Those features which discriminate between cats and dogs in the images have not been helpful to reducing your loss function. Actually it will have miss-learnt features common to both cats and dogs as being dog-specific, and will have to unlearn them later as it sees more data.

In terms of the overall outcome: during the training process you are essentially traversing a highly dimensional optimisation space following its gradient until the configuration of weights arrives at a local minimum in this space from which the magnitude of the barrier to escape exceeds that which is allowed by your learning rate. Showing one class then the other will lead to a more meandering path towards the global minimum and thus increase the likelihood of becoming stuck in a sub-optimal local minimum. [2]

2- If all of the images in your data set contain a dog, you really want to label that dog in every image. This does three things:

Doubles the size of your data set (more data = better results).
Prevents falsely penalising the model for accurately detecting a dog in the images where you have not labelled the dog.
Prevents the algorithm from detecting unrelated features in the images.

Doubling the size of your data set is good for obvious reasons. But by showing inputs that contain a dog without labelling that dog you are essentially telling your algorithm that that image contains no dog[3]. Which is false. You are essentially changing the patterns you are asking the algorithm to detect from those which can separate cat/dog vs. no-cat/dog and cat vs dog to whose which can separate labelled-dogs vs unlabelled-dogs, which are not helpful features for your task.

Lastly by failing to label half of the dogs, your algorithm will learn to discriminate between those dogs which are labelled and those which are not. This means instead of learning features common to dogs, it will learn features that separate those dogs in the labelled images from those in the unlabelled images. These could be background features in the images or small generalisations which appear more strongly in the labelled dogs that the unlabelled by chance.

3- This question is a little more difficult, and there is no easy solution to your problem here. Your model can only learn features to which it is exposed during training and this means if you only show it one image of a cat (or several images in which the representation of the cat is identical) your model will learn features specific to this one image. This will lead quickly to the common problem of over-fitting, where your model learns features which are specific to your training examples and do not generalise well to other instances of cats.

It would not be sufficient to crop out the cat during training and then simply include the cat in the eval data set because you will be asking the model to detect features to which it has not been exposed during training and thus has not learned.

You want to include your labelled cat in every instance in which it appears in your data and regularise your network to limit over-fitting. In addition to this in the presence of data poverty it is often beneficial to use pre-training to learn cat specific features from unlabelled data prior to training, and/or to use data augmentation to artificially enhance the diversity of your data.

These suggestions are likely to improve your results but the reality is that sourcing large, diverse, data sets that comprehensively incorporate those features that are key to identifying your object is a major part of building a successful deep learning model. It depends on how uniform the instances of the cat are in your data, but if your model has only ever seen a cat from the front, its not going to recognise a cat from the back.

TLDR:

1 YES, shuffle them.
2 YES, label them all.
3 Get better data. Or: pretrain, regularise and augment data.

[1] This does depend on the size of the batches in which you feed your data into the model.

[2] This is based on my own intuition and I am happy to be corrected here.

[3] This depends to some extent on how your loss function handles images in which there is no dog.