I am a bit confused about how to consume huggingface transformers
outputs to train a simple language binary classifier model that predicts if Albert Einstein said a sentence or not.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["Hello World", "Hello There", "Bye Bye", "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe."]
for input in inputs:
inputs = tokenizer(input, return_tensors="pt")
outputs = model(**inputs)
print(outputs[0].shape, input, len(input))
Output:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
torch.Size([1, 4, 768]) Hello World 11
torch.Size([1, 4, 768]) Hello There 11
torch.Size([1, 4, 768]) Bye Bye 7
torch.Size([1, 23, 768]) Two things are infinite: the universe and human stupidity; and I'm not sure about the universe. 95
As you can see the dimensions of the output varies with the length of the input. Now assume I would like to train a binary classifier that predicts if Einstein said the input sentence or not and the input of the network will be the prediction of the BERT transformer
.
How could I write a CNN model that takes a tensor [1, None, 768]
in pytorch? It seems that the second dimension changes with the length of the input.
In pytorch you don't need to have a fixed input dim
for a CNN. The only requirement is that your kernel_size
must not be smaller than the input_size
.
Generally, the best way of putting a classifier (sequence classifier) on top of a Transformer model is to add a pooling layer + FC layer. You can use global pooling, an average or max pooling or an adptative pooling and then a full connected layer.
Note that you can also use AutoModelForSequenceClassification to get everything done for you.
#An example with a simple average pooling
from transformers import AutoTokenizer, AutoModel
import torch
NUM_CLASSES = 1
MAX_LEN = 30
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
model.classifier = torch.nn.Linear(model.pooler.dense.in_features, NUM_CLASSES)
inputs_str = ["Hello World", "Hello There", "Bye Bye", "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe."]
inputs = tokenizer(inputs_str, padding="max_length", return_tensors="pt", max_length=MAX_LEN)
outputs, _ = model(**inputs)
outputs = torch.mean(outputs, dim=1)
outputs = model.classifier(outputs)
print(outputs.shape) #=> (4,1)