To generate custom dataset
from datasets import Dataset,ClassLabel,Value
features = ({
"sentence1": Value("string"), # String type for sentence1
"sentence2": Value("string"), # String type for sentence2
"label": ClassLabel(names=["not_equivalent", "equivalent"]), # ClassLabel definition
"idx": Value("int32"),
})
custom_dataset = Dataset.from_dict(train_pairs)
custom_dataset = custom_dataset.cast(features)
custom_dataset
My train_pairs looks like below train_pairs - sample
{'sentence1': "that 's far too tragic to merit such superficial treatment ",
'sentence2': "that 's far too tragic to merit such superficial treatment ",
**'label': <ClassLabel.not_equivalent: 0>**,
'idx': 5}
/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()
**TypeError**: expected bytes, int found
So I changed label to integer
{'sentence1': "that 's far too tragic to merit such superficial treatment ",
'sentence2': "that 's far too tragic to merit such superficial treatment ",
**'label': 0**,
'idx': 5}
/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()
**TypeError**: expected bytes, int found
I am trying to model my dataset as below. (data sample + feature info)
ClassLabel
can directly read integers, so no need to try and pre-process it into a specific format which it seems like maybe you tried? (<ClassLabel.not_equivalent: 0>
)
Your features should be defined using the Dataset.Features
method
features = Features({
"sentence1": Value("string"), # String type for sentence1
"sentence2": Value("string"), # String type for sentence2
"label": ClassLabel(names=["not_equivalent", "equivalent"]), # ClassLabel definition
"idx": Value("int32"),
})
And when passing through the dictionary of train pairs, it should be defined as a dictionary with list elements for values where it compares by index in the lists (using ChatGPT to generate a few extra examples, but works with just one also)
train_pairs = {
'sentence1': [
"that 's far too tragic to merit such superficial treatment",
"the quick brown fox jumps over the lazy dog",
"I love machine learning and data science",
],
'sentence2': [
"that 's far too tragic to merit such superficial treatment",
"a fast dark-colored fox leaps over a sleepy canine",
"I love machine learning and data science",
],
'label': [1, 0, 1,],
'idx': [0, 1 ,2 ,],
}
Running this gives me no errors, and is a reproducible way for you to solve your issues with labels.