Search code examples
pythonpython-3.xneural-networkpytorchhuggingface-transformers

Add dense layer on top of Huggingface BERT model


I want to add a dense layer on top of the bare BERT Model transformer outputting raw hidden-states, and then fine tune the resulting model. Specifically, I am using this base model. This is what the model should do:

  1. Encode the sentence (a vector with 768 elements for each token of the sentence)
  2. Keep only the first vector (related to the first token)
  3. Add a dense layer on top of this vector, to get the desired transformation

So far, I have successfully encoded the sentences:

from sklearn.neural_network import MLPRegressor

import torch

from transformers import AutoModel, AutoTokenizer

# List of strings
sentences = [...]
# List of numbers
labels = [...]

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")

# 2D array, one line per sentence containing the embedding of the first token
encoded_sentences = torch.stack([model(**tokenizer(s, return_tensors='pt'))[0][0][0]
                                 for s in sentences]).detach().numpy()

regr = MLPRegressor()
regr.fit(encoded_sentences, labels)

In this way I can train a neural network by feeding it with the encoded sentences. However, this approach clearly does not fine tune the base BERT model. Can anybody help me? How can I build a model (possibly in pytorch or using the Huggingface library) that can be entirely fine tuned?


Solution

  • There are two ways to do it: Since you are looking to fine-tune the model for a downstream task similar to classification, you can directly use:

    BertForSequenceClassification class. Performs fine-tuning of logistic regression layer on the output dimension of 768.

    Alternatively, you can define a custom module, that created a bert model based on the pre-trained weights and adds layers on top of it.

    from transformers import BertModel
    class CustomBERTModel(nn.Module):
        def __init__(self):
              super(CustomBERTModel, self).__init__()
              self.bert = BertModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
              ### New layers:
              self.linear1 = nn.Linear(768, 256)
              self.linear2 = nn.Linear(256, 3) ## 3 is the number of classes in this example
    
        def forward(self, ids, mask):
              sequence_output, pooled_output = self.bert(
                   ids, 
                   attention_mask=mask)
    
              # sequence_output has the following shape: (batch_size, sequence_length, 768)
              linear1_output = self.linear1(sequence_output[:,0,:].view(-1,768)) ## extract the 1st token's embeddings
    
              linear2_output = self.linear2(linear1_output)
    
              return linear2_output
    
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
    model = CustomBERTModel() # You can pass the parameters if required to have more flexible model
    model.to(torch.device("cpu")) ## can be gpu
    criterion = nn.CrossEntropyLoss() ## If required define your own criterion
    optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))
    
    for epoch in epochs:
        for batch in data_loader: ## If you have a DataLoader()  object to get the data.
    
            data = batch[0]
            targets = batch[1] ## assuming that data loader returns a tuple of data and its targets
            
            optimizer.zero_grad()   
            encoding = tokenizer.batch_encode_plus(data, return_tensors='pt', padding=True, truncation=True,max_length=50, add_special_tokens = True)
            outputs = model(input_ids, attention_mask=attention_mask)
            outputs = F.log_softmax(outputs, dim=1)
            input_ids = encoding['input_ids']
            attention_mask = encoding['attention_mask']
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()