Search code examples
pythonhuggingface-datasets

type error while creating custom dataset using huggingface dataset


To generate custom dataset

from datasets import Dataset,ClassLabel,Value

features = ({
  "sentence1": Value("string"),  # String type for sentence1
  "sentence2": Value("string"),  # String type for sentence2
  "label": ClassLabel(names=["not_equivalent", "equivalent"]),  # ClassLabel definition
  "idx": Value("int32"),
})
custom_dataset = Dataset.from_dict(train_pairs)
custom_dataset = custom_dataset.cast(features)
custom_dataset

My train_pairs looks like below train_pairs - sample

{'sentence1': "that 's far too tragic to merit such superficial treatment ", 
'sentence2': "that 's far too tragic to merit such superficial treatment ", 
**'label': <ClassLabel.not_equivalent: 0>**, 
'idx': 5}

/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

**TypeError**: expected bytes, int found
So I changed label to integer

{'sentence1': "that 's far too tragic to merit such superficial treatment ", 
'sentence2': "that 's far too tragic to merit such superficial treatment ", 
**'label': 0**, 
'idx': 5}

/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

**TypeError**: expected bytes, int found

I am trying to model my dataset as below. (data sample + feature info)

enter image description here


Solution

  • ClassLabel can directly read integers, so no need to try and pre-process it into a specific format which it seems like maybe you tried? (<ClassLabel.not_equivalent: 0>)

    Your features should be defined using the Dataset.Features method

    features = Features({
        "sentence1": Value("string"),  # String type for sentence1
        "sentence2": Value("string"),  # String type for sentence2
        "label": ClassLabel(names=["not_equivalent", "equivalent"]),  # ClassLabel definition
        "idx": Value("int32"),
    })
    

    And when passing through the dictionary of train pairs, it should be defined as a dictionary with list elements for values where it compares by index in the lists (using ChatGPT to generate a few extra examples, but works with just one also)

    train_pairs = {
        'sentence1': [
            "that 's far too tragic to merit such superficial treatment",
            "the quick brown fox jumps over the lazy dog",
            "I love machine learning and data science",
        ],
        'sentence2': [
            "that 's far too tragic to merit such superficial treatment",
            "a fast dark-colored fox leaps over a sleepy canine",
            "I love machine learning and data science",
        ],
        'label': [1, 0, 1,], 
        'idx': [0, 1 ,2 ,],
    }
    

    Running this gives me no errors, and is a reproducible way for you to solve your issues with labels.