Search code examples
datasethuggingfacehuggingface-datasets

Is there is a way that I can download only a part of the dataset from huggingface?


I'm trying to load (peoples speech) dataset, but it's way too big, is there's a way to download only a part of it?

from datasets import load_dataset

from datasets import load_dataset

train = load_dataset("MLCommons/peoples_speech", "clean",split="train[:10%]")
test = load_dataset("MLCommons/peoples_speech", "clean",split="test[:10%]")

Using ("train [: 10%]") didn't help, it still trying to download the entire dataset...


Solution

  • Did you consider using the stream feature in Datasets? It allows you to stream data from HuggingFace's Hub without having to download the dataset locally.

    from datasets import load_dataset
    from torch.utils.data import DataLoader
    # you get a dict of {"split": IterableDataset}
    dataset = load_dataset("MLCommons/peoples_speech", "clean", streaming=True)
    # your preprocessing and filtering
    ...
    train_dataloader = DataLoader(dataset["train"], batch_size=4)
    valid_dataloader = DataLoader(dataset["validation"], batch_size=4)
    train_steps_per_epoch = 500
    # training loop
    for n in range(5):
        for i, batch in enumerate(train_dataloader):
            # if you only want to do a limited amount of optimization steps per epoch
            if i == train_steps_per_epoch:
                break
            # train step
            ...
    

    I hope it helps.