Search code examples
pythonhuggingface-transformers

Organize data for transformer fine-tuning


I have a corpus of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": <string>, "sentence2": <string>, "label": <1.0 or 0.0> }. Note that this words (or sentences) do not have to be a single token in the tokenizer.

I want to fine-tune a BERT-based model to take both sentences like: [[CLS], <sentence1_token1>], ...,<sentence1_tokenN>, [SEP], <sentence2_token1>], ..., <sentence2_tokenM>, [SEP]] and predict the "label" (a measurement between 0.0 and 1.0).

What is the best approach to organized this data to facilitate the fine-tuning of the huggingface transformer?


Solution

  • You can use the Tokenizer __call__ method to join both sentences when encoding them.

    In case you're using the PyTorch implementation, here is an example:

    import torch
    from transformers import AutoTokenizer
    
    sentences1 = ... # List containing all sentences 1
    sentences2 = ... # List containing all sentences 2
    labels = ... # List containing all labels (0 or 1)
    
    TOKENIZER_NAME = "bert-base-cased"
    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
    
    encodings = tokenizer(
        sentences1,
        sentences2,
        return_tensors="pt"
    )
    
    labels = torch.tensor(labels)
    

    Then you can create your custom Dataset to use it on training:

    class CustomRealDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels
    
        def __getitem__(self, idx):
            item = {key: value[idx] for key, value in self.encodings.items()}
            item["labels"] = self.labels[idx]
            return item
    
        def __len__(self):
            return len(self.labels)