deep-learning conv-neural-network object-detection mask image-segmentation

Understanding MaskRCNN Annotation feed

I'm currently working on a Object Detection project using Matterport MaskRCNN.

As part of the job is to detect a Green leaf that crosses a white grid. Until now I have defined the annotation (Polygons) in such a way that every single leaf which crosses the net (and gives white-green-white pattern) is considered a valid annotation.

But, when changing the definition above from single-cross annotation to multi-cross (more than one leaf crossing the net at once), I started to see a serious decrease in model performance during testing phase.

This raised my question - The only difference between the two comes down to size of the annotation. So:

Which of the following is more influential on learning during MaskRCNN's training - pattern or size?

If the pattern is influential, it's better. Because the goal is to identify a crossing. Conversely, if the size of the annotation is the influencer, then that's a problem, because I don't want the model to look for multi-cross or alternatively large single-cross in the image.

P.S. - References to recommended articles that explain the subject will be welcomed

Thanks in advance

Solution

If I understand correctly the shape of the annotation becomes longer and more stretched out if going for multicross annotation.

In that case you can change the size and side ratio of the anchors that are scanning the image for objects. With default settings the model often has squarish bounding boxes. This means that very long and narrow annotations create bounding boxes with a great difference between width and height. These objects seem to be harder to segment and detect by the model.

These are the default configurations in the config.py file:

Length of square anchor side in pixels

RPN_ANCHOR_SCALES = (32, 64, 128, 256, 512)

Ratios of anchors at each cell (width/height). A value of 1 represents a square anchor, and 0.5 is a wide anchor

RPN_ANCHOR_RATIOS = [0.5, 1, 2]

You can play around with these values in inference mode and look if it gives you some better results.