I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict
, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example.
train_df = pd.DataFrame({
"label" : [1, 2, 3],
"text" : ["apple", "pear", "strawberry"]
})
test_df = pd.DataFrame({
"label" : [2, 2, 1],
"text" : ["banana", "pear", "apple"]
})
What is the most efficient way to convert these to the type above?
One possibility is to first create two Datasets and then join them:
import datasets
import pandas as pd
train_df = pd.DataFrame({
"label" : [1, 2, 3],
"text" : ["apple", "pear", "strawberry"]
})
test_df = pd.DataFrame({
"label" : [2, 2, 1],
"text" : ["banana", "pear", "apple"]
})
train_dataset = Dataset.from_dict(train_df)
test_dataset = Dataset.from_dict(test_df)
my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
The result is:
DatasetDict({
train: Dataset({
features: ['label', 'text'],
num_rows: 3
})
test: Dataset({
features: ['label', 'text'],
num_rows: 3
})
})