I have 4 JSON containing the same structure and the same amount of data (but the contents are different). I upload 4 files to my HuggingFace dataset repository.
My first try
Uploaded 4 files to repository directory:
As a result, HuggingFace combined 4 files into 1 dataset.
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 551964
})
})
My second try
I renamed the 4 files into alpha.json
, beta.json
, delta.json
, gamma.json
. The result is the same.
My third try
I put the 4 files into 4 folders:
The result is still the same.
According to this official documentation, it only recognizes certain file & folder patterns.
My goal is to load my dataset like this, with 4 custom splits:
ds = load_dataset("myusername/my-dataset")
print(ds)
and the output is:
DatasetDict({
alpha: Dataset({ # loads data-1.json
features: ['translation'],
num_rows: 137991
}),
beta: Dataset({ # loads data-2.json
features: ['translation'],
num_rows: 137991
}),
delta: Dataset({ # loads data-3.json
features: ['translation'],
num_rows: 137991
}),
gamma: Dataset({ # loads data-4.json
features: ['translation'],
num_rows: 137991
})
})
The only stupid way I can think of is to create 4 dataset repositories, which is uneasy to manage.
Just found out that I need a special folder and file naming pattern to achieve my goal:
my_repository/
├── README.md
└── data/
├── alpha-00000-of-00001.csv
├── beta-00000-of-00001.csv
├── delta-00000-of-00001.csv
├── gamma-00000-of-00001.csv
which the load_dataset()
function will result:
DatasetDict({
alpha: Dataset({
features: ['translation'],
num_rows: 137991
})
beta: Dataset({
features: ['translation'],
num_rows: 137991
})
delta: Dataset({
features: ['translation'],
num_rows: 137991
})
gamma: Dataset({
features: ['translation'],
num_rows: 137991
})
})