BERT model for Language Model and Sequence classification includes an extra projection layer between the last transformer and the classification layer (it contains a linear layer of size hidden_dim x hidden_dim
, a dropout layer and a tanh
activation). This was not described in the paper originally but was clarified here. This intermediate layer is pre-trained together with the rest of the transformers.
In huggingface's BertModel
, this layer is called pooler
.
According to the paper, FlauBERT model (XLMModel fine-tuned on French corpus) also includes this pooler layer: "The classification head is composed of the following layers, in order: dropout, linear,tanhactivation, dropout, and linear.". However, when loading a FlauBERT model with huggingface (e.g, with FlaubertModel.from_pretrained(...)
, or FlaubertForSequenceClassification.from_pretrained(...)
), the model seem to include no such layer.
Hence the question: why is there no pooler layer in huggingfaces' FlauBERT model ?
Pooler is necessary for the next sentence classification task. This task has been removed from Flaubert training making Pooler an optional layer. HuggingFace commented that "pooler's output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence". Thus I belive they decided to remove the layer.