I have a very large json file that I'm trying to load as a dataset:
from datasets import load_dataset
dataset = load_dataset("json", data_files="data/my_dataset.json")
The issue is that the file is too large and I'm running out of memory when I'm trying to load it. Is there a way to only load a portion of it? Say, the first 1000 rows?
You can use the streaming function https://huggingface.co/docs/datasets/stream
from datasets import load_dataset
dataset = load_dataset("json", data_files="data/my_dataset.json", streaming=True)
# To read the first data point.
print(next(iter(dataset)))