Why BERT model have to keep 10% MASK token unchanged?

I am reading BERT model paper. In Masked Language Model task during pre-training BERT model, the paper said the model will choose 15% token ramdomly. In the chose token (Ti), 80% it will be replaced with [MASK] token, 10% Ti is unchanged and 10% Ti replaced with another word. I think the model just need to replace with [MASK] or another word is enough. Why does the model have to choose randomly a word and keep it unchanged? Does pre-training process predict only [MASK] token or it predict 15% a whole random token?

Solution

This is done because they want to pre-train a bidirectional model. Most of the time the network will see a sentence with a [MASK] token, and its trained to predict the word that is supposed to be there. But in fine-tuning, which is done after pre-training (fine-tuning is the training done by everyone who wants to use BERT on their task), there are no [MASK] tokens! (unless you specifically do masked LM).

This mismatch between pre-training and training (sudden disappearence of the [MASK] token) is softened by them, with a probability of 15% the word is not replaced by [MASK]. The task is still there, the network has to predict the token, but it actually gets the answer already as input. This might seem counterintuitive but makes sense when combined with the [MASK] training.