pytorch huggingface-transformers huggingface-datasets

How to load a portion of a file as a huggingface dataset?

I have a very large json file that I'm trying to load as a dataset:

from datasets import load_dataset
dataset = load_dataset("json", data_files="data/my_dataset.json")

The issue is that the file is too large and I'm running out of memory when I'm trying to load it. Is there a way to only load a portion of it? Say, the first 1000 rows?

Solution

You can use the streaming function https://huggingface.co/docs/datasets/stream

from datasets import load_dataset
dataset = load_dataset("json", data_files="data/my_dataset.json", streaming=True)

# To read the first data point.
print(next(iter(dataset)))