transformer-model bert-language-model huggingface-transformers

If BERT's [CLS] can be retrained for a variety of sentence classification objectives, what about [SEP]?

In BERT pretraining, the [CLS] token is embedded into the input of a classifier tasked with the Next Sentence Prediction task (or, in some BERT variants, with other tasks, such as ALBERT's Sentence Order Prediction); this helps in the pretraining of the entire transformer, and it also helps to make the [CLS] position readily available for retraining to other "sentence scale" tasks.

I wonder whether [SEP] could also be retrained in the same manner. While [CLS] will probably be easier to retrain as the transformer is already trained to imbue its embedding with meaning from across the sentence, while [SEP] does not have these "connections" (one would assume), this might still work with sufficient fine-tuning.

With this one could retrain the same model for two different classification tasks, one using [CLS] and one using [SEP].

Am I missing anything? Is there a reason why this would not work?

Solution

In theory it can give 'some' results so it would work (it's just a token), but the question is why you would want to that. These tokens have been pretrained for a specific purpose. I suppose that by 'retrain' you mean finetuning, so if you would finetune the SEP token suddenly as a classification token, I think you won't get good results because you are only fine-tuning one token in the whole language model for a task that it wasn't even pretrained for.