Search code examples
pythonhuggingface-transformershuggingface-datasets

Cast features to ClassLabel


I have a dataset with type dictionary which I converted to Dataset:

ds = datasets.Dataset.from_dict(bio_dict)

The shape now is:

Dataset({
    features: ['id', 'text', 'ner_tags', 'input_ids', 'attention_mask', 'label'],
    num_rows: 8805
})

When I use the train_test_split function of Datasets I receive the following error:

train_testvalid = ds.train_test_split(test_size=0.5, shuffle=True, stratify_by_column="label")

ValueError: Stratifying by column is only supported for ClassLabel column, and column label is Sequence.

How can I change the type to ClassLabel so that stratify works?


Solution

  • You should apply the following class_encode_column function:

    ds = ds.class_encode_column("label")