Search code examples
pandashuggingface-datasets

Convert pandas dataframe to datasetDict


I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example.

train_df = pd.DataFrame({
     "label" : [1, 2, 3],
     "text" : ["apple", "pear", "strawberry"]
})

test_df = pd.DataFrame({
     "label" : [2, 2, 1],
     "text" : ["banana", "pear", "apple"]
})

What is the most efficient way to convert these to the type above?


Solution

  • One possibility is to first create two Datasets and then join them:

    import datasets
    import pandas as pd
    
    
    train_df = pd.DataFrame({
         "label" : [1, 2, 3],
         "text" : ["apple", "pear", "strawberry"]
    })
    
    test_df = pd.DataFrame({
         "label" : [2, 2, 1],
         "text" : ["banana", "pear", "apple"]
    })
    
    train_dataset = Dataset.from_dict(train_df)
    test_dataset = Dataset.from_dict(test_df)
    my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
    

    The result is:

    DatasetDict({
        train: Dataset({
            features: ['label', 'text'],
            num_rows: 3
        })
        test: Dataset({
            features: ['label', 'text'],
            num_rows: 3
        })
    })