What is the loss function of the Mask RCNN?

The paper has clearly mentioned the classification and regression losses are identical to the RPN network in the Faster RCNN . Can someone explain the Mask Loss function . How the use FCN to improve ?

Solution

FCN uses per-pixel softmax and a multinominal loss. This means, that the mask prediction task (the boundaries of the object) and the class prediction task (what is the object being masked) are coupled.
Mask-RCNN decouples these tasks: the existing bounding-box prediction (AKA the localization task) head predicts the class, like faster-RCNN, and the mask branch generates a mask for each class, without competition among classes (e.g. if you have 21 classes the mask branch predicts 21 masks instead of FCN's single mask with 21 channels). The loss being used is per-pixel sigmoid + binary loss.
Bottom line, it's Sigmoid in Mask-RCNN vs. Soft-max in FCN.
(See table 2.b. in Mask RCNN paper - Ablation section).