bert-language-model huggingface-transformers

Why is there no pooler layer in huggingfaces' FlauBERT model?

BERT model for Language Model and Sequence classification includes an extra projection layer between the last transformer and the classification layer (it contains a linear layer of size hidden_dim x hidden_dim, a dropout layer and a tanh activation). This was not described in the paper originally but was clarified here. This intermediate layer is pre-trained together with the rest of the transformers.

In huggingface's BertModel, this layer is called pooler.

According to the paper, FlauBERT model (XLMModel fine-tuned on French corpus) also includes this pooler layer: "The classification head is composed of the following layers, in order: dropout, linear,tanhactivation, dropout, and linear.". However, when loading a FlauBERT model with huggingface (e.g, with FlaubertModel.from_pretrained(...), or FlaubertForSequenceClassification.from_pretrained(...)), the model seem to include no such layer.

Hence the question: why is there no pooler layer in huggingfaces' FlauBERT model ?

Solution

Pooler is necessary for the next sentence classification task. This task has been removed from Flaubert training making Pooler an optional layer. HuggingFace commented that "pooler's output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence". Thus I belive they decided to remove the layer.