Transformers always only use a single Linear layer for classification head?

For example, in the class BertForSequenceClassification definition, only one Linear layer is used for the classifier. If just one Linear layer is used, doesn’t it just do linear projection for pooled_out? Will such a classifier produce good predictions? Why not use multiple Linear layers? Does transformers offer any option for using multiple Linear layers as the classification head?

I looked at several other classes. They all use a single Linear layer as the classification head.

Solution

To add onto the previous answer,

Embedding layers (self.bert = BertModel(config) in your case) transform the original data (a sentence, an image etc.) into some semantic-aware vector spaces. This is where all the architecture designs come in (e.g. attention, cnn, lstm etc.), which are all far more superior than a simple FC for their chosen tasks. So if you have the capacity of adding multiple FCs, why not just add another attention block? On the other hand, the embeddings from a decent model should have large inter-class distance and small intra-class variance, which could easily be projected to their corresponding classes in a linear fashion, and a FC is more than enough.
It would be ideal to have the pretrained portion as big as possible such that, as a downstream user, I just have to train/finetune a tiny bit of the model (e.g. the fc classification layer)