class BERTPooler(nn.Module): def init(self, config): super(BERTPooler, self).init() self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
The author of the original BERT paper answered it (kind of) in a comment on GitHub.
The tanh() thing was done early to try to make it more interpretable but it probably doesn't matter either way.
I agree it doesn't fully answer "whether" tanh
is preferable, but from the looks of it, it'll probably work with any activation.