I have a corpus of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": <string>, "sentence2": <string>, "label": <1.0 or 0.0> }
. Note that this words (or sentences) do not have to be a single token in the tokenizer.
I want to fine-tune a BERT-based model to take both sentences like: [[CLS], <sentence1_token1>], ...,<sentence1_tokenN>, [SEP], <sentence2_token1>], ..., <sentence2_tokenM>, [SEP]]
and predict the "label" (a measurement between 0.0 and 1.0).
What is the best approach to organized this data to facilitate the fine-tuning of the huggingface transformer?
You can use the Tokenizer __call__
method to join both sentences when encoding them.
In case you're using the PyTorch implementation, here is an example:
import torch
from transformers import AutoTokenizer
sentences1 = ... # List containing all sentences 1
sentences2 = ... # List containing all sentences 2
labels = ... # List containing all labels (0 or 1)
TOKENIZER_NAME = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
encodings = tokenizer(
sentences1,
sentences2,
return_tensors="pt"
)
labels = torch.tensor(labels)
Then you can create your custom Dataset to use it on training:
class CustomRealDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: value[idx] for key, value in self.encodings.items()}
item["labels"] = self.labels[idx]
return item
def __len__(self):
return len(self.labels)