dataset huggingface huggingface-datasets

Is there is a way that I can download only a part of the dataset from huggingface?

I'm trying to load (peoples speech) dataset, but it's way too big, is there's a way to download only a part of it?

from datasets import load_dataset

from datasets import load_dataset

train = load_dataset("MLCommons/peoples_speech", "clean",split="train[:10%]")
test = load_dataset("MLCommons/peoples_speech", "clean",split="test[:10%]")

Using ("train [: 10%]") didn't help, it still trying to download the entire dataset...

Solution

Did you consider using the stream feature in Datasets? It allows you to stream data from HuggingFace's Hub without having to download the dataset locally.

from datasets import load_dataset
from torch.utils.data import DataLoader
# you get a dict of {"split": IterableDataset}
dataset = load_dataset("MLCommons/peoples_speech", "clean", streaming=True)
# your preprocessing and filtering
...
train_dataloader = DataLoader(dataset["train"], batch_size=4)
valid_dataloader = DataLoader(dataset["validation"], batch_size=4)
train_steps_per_epoch = 500
# training loop
for n in range(5):
    for i, batch in enumerate(train_dataloader):
        # if you only want to do a limited amount of optimization steps per epoch
        if i == train_steps_per_epoch:
            break
        # train step
        ...

I hope it helps.