Organize data for transformer fine-tuning

I have a corpus of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": <string>, "sentence2": <string>, "label": <1.0 or 0.0> }. Note that this words (or sentences) do not have to be a single token in the tokenizer.

I want to fine-tune a BERT-based model to take both sentences like: [[CLS], <sentence1_token1>], ...,<sentence1_tokenN>, [SEP], <sentence2_token1>], ..., <sentence2_tokenM>, [SEP]] and predict the "label" (a measurement between 0.0 and 1.0).

What is the best approach to organized this data to facilitate the fine-tuning of the huggingface transformer?

Solution

You can use the Tokenizer __call__ method to join both sentences when encoding them.

In case you're using the PyTorch implementation, here is an example:

import torch
from transformers import AutoTokenizer

sentences1 = ... # List containing all sentences 1
sentences2 = ... # List containing all sentences 2
labels = ... # List containing all labels (0 or 1)

TOKENIZER_NAME = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)

encodings = tokenizer(
    sentences1,
    sentences2,
    return_tensors="pt"
)

labels = torch.tensor(labels)

Then you can create your custom Dataset to use it on training:

class CustomRealDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: value[idx] for key, value in self.encodings.items()}
        item["labels"] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)