Search code examples
nlpbert-language-model

why do pooler use tanh as a activation func in bert, rather than gelu?


class BERTPooler(nn.Module): def init(self, config): super(BERTPooler, self).init() self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.activation = nn.Tanh()

def forward(self, hidden_states):
    # We "pool" the model by simply taking the hidden state corresponding
    # to the first token.
    first_token_tensor = hidden_states[:, 0]
    pooled_output = self.dense(first_token_tensor)
    pooled_output = self.activation(pooled_output)
    return pooled_output

Solution

  • The author of the original BERT paper answered it (kind of) in a comment on GitHub.

    The tanh() thing was done early to try to make it more interpretable but it probably doesn't matter either way.

    I agree it doesn't fully answer "whether" tanh is preferable, but from the looks of it, it'll probably work with any activation.