How do instance segmentation methods deal with partially labelled data?

Consider that I have a dataset containing labelled images of cats and dogs. Let's say that only 50% of cat instances, in images that contain dogs, are labelled. Also, let's assume that 50% of images that contain dogs will contain cats. I want to classify each instance of a dog or cat with a bounding box, mask and a score (i.e., I'm using Mask-RCNN).

This question here is: Do the unlabelled instances of cats affect the classification/segmentation heads of the model? Is there a simple reason why?

Most people I've asked claim that they have no effect on the quality of the model, but they aren't able to explain why this is the case. I think that a part of the problem here is that I've got more experience with "classical" models, where every data point contributes to the loss. I also think that I don't speak the language/jargon of deep learning well enough to know where to start looking for answers. Please help!

Solution

It depends on the formulation of the learning setup. If you consider each image as a collection of samples, i.e. objects contained therein, each with their bounding box and a label in {cat, dog}, you should be fine. In essence, you will be asking for the generation of a bounding box and a label, each one of which will be matched to a ground truth. Wrong predictions of a bounding box or label will generate an error signal by contrasting them to the corresponding truth, which means you have something to penalize your model for (or train it with). Missing labels will not be contrasted with anything, basically meaning that you are simply underutilizing your images, which is not in itself a deal breaker.

If, on the other hand, you are generating all bounding boxes from a single image, and penalizing your model for creating "redundant" boxes, you run the risk of penalizing correct predictions that are unlabeled, which is indeed bad. If the count of penalized good predictions is comparable to penalized bad predictions, you would be doing more harm than good.

Perhaps the way to go would be to start with training only on images that have a full and correct labeling, and then move on to include the noisy ones, manually checking whether the "extra" predictions actually correspond to real but unlabeled entities, then fixing the data as need a la human-in-the-loop training.