Search code examples
pytorchhuggingface-transformershuggingface-datasets

How to load a portion of a file as a huggingface dataset?


I have a very large json file that I'm trying to load as a dataset:

from datasets import load_dataset
dataset = load_dataset("json", data_files="data/my_dataset.json")

The issue is that the file is too large and I'm running out of memory when I'm trying to load it. Is there a way to only load a portion of it? Say, the first 1000 rows?


Solution

  • You can use the streaming function https://huggingface.co/docs/datasets/stream

    from datasets import load_dataset
    dataset = load_dataset("json", data_files="data/my_dataset.json", streaming=True)
    
    # To read the first data point.
    print(next(iter(dataset)))