type error while creating custom dataset using huggingface dataset

To generate custom dataset

from datasets import Dataset,ClassLabel,Value

features = ({
  "sentence1": Value("string"),  # String type for sentence1
  "sentence2": Value("string"),  # String type for sentence2
  "label": ClassLabel(names=["not_equivalent", "equivalent"]),  # ClassLabel definition
  "idx": Value("int32"),
})
custom_dataset = Dataset.from_dict(train_pairs)
custom_dataset = custom_dataset.cast(features)
custom_dataset

My train_pairs looks like below train_pairs - sample

{'sentence1': "that 's far too tragic to merit such superficial treatment ", 
'sentence2': "that 's far too tragic to merit such superficial treatment ", 
**'label': <ClassLabel.not_equivalent: 0>**, 
'idx': 5}

/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

**TypeError**: expected bytes, int found
So I changed label to integer

{'sentence1': "that 's far too tragic to merit such superficial treatment ", 
'sentence2': "that 's far too tragic to merit such superficial treatment ", 
**'label': 0**, 
'idx': 5}

/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

**TypeError**: expected bytes, int found

I am trying to model my dataset as below. (data sample + feature info)

Solution

ClassLabel can directly read integers, so no need to try and pre-process it into a specific format which it seems like maybe you tried? (<ClassLabel.not_equivalent: 0>)

Your features should be defined using the Dataset.Features method

features = Features({
    "sentence1": Value("string"),  # String type for sentence1
    "sentence2": Value("string"),  # String type for sentence2
    "label": ClassLabel(names=["not_equivalent", "equivalent"]),  # ClassLabel definition
    "idx": Value("int32"),
})

And when passing through the dictionary of train pairs, it should be defined as a dictionary with list elements for values where it compares by index in the lists (using ChatGPT to generate a few extra examples, but works with just one also)

train_pairs = {
    'sentence1': [
        "that 's far too tragic to merit such superficial treatment",
        "the quick brown fox jumps over the lazy dog",
        "I love machine learning and data science",
    ],
    'sentence2': [
        "that 's far too tragic to merit such superficial treatment",
        "a fast dark-colored fox leaps over a sleepy canine",
        "I love machine learning and data science",
    ],
    'label': [1, 0, 1,], 
    'idx': [0, 1 ,2 ,],
}

Running this gives me no errors, and is a reproducible way for you to solve your issues with labels.