Search code examples
deep-learningpytorchpytorch-dataloaderhuggingface-datasets

Pytorch delete features columns from dataset


I have a dataset below and would like to delete features From A - F the dataset are converted from python dataframe

dataset = datasets.DatasetDict({"train":Dataset.from_pandas(X_train),
                        "test":Dataset.from_pandas(X_test),
                        "val":Dataset.from_pandas(X_val),
                      })

The dataset output like below

DatasetDict({
train: Dataset({
    features: ['A', 'B', 'C', 'D', 'E', 'F', 'text', '__index_level_0__', 'label'],
    num_rows: 1173
})
test: Dataset({
    features: ['A', 'B', 'C', 'D', 'E', 'F', 'text', '__index_level_0__', 'label'],
    num_rows: 1369
})
val: Dataset({
    features: ['A', 'B', 'C', 'D', 'E', 'F', 'text', '__index_level_0__', 'label'],
    num_rows: 1369
})

})

Result like below

DatasetDict({
train: Dataset({
    features: ['text', '__index_level_0__', 'label'],
    num_rows: 1173
})
test: Dataset({
    features: ['text', '__index_level_0__', 'label'],
    num_rows: 1369
})
val: Dataset({
    features: ['text', '__index_level_0__', 'label'],
    num_rows: 1369
})

})


Solution

  • What you need is the remove_columns() method from datasets. This works on any Dataset() object, if you want to remove some columns at this level and not in Pandas before.

    dataset = dataset.remove_columns("label")
    

    For your case, it would be:

    dataset = dataset.remove_columns(['A', 'B', 'C', 'D', 'E', 'F'])
    

    You can have a look here: https://huggingface.co/docs/datasets/process