Search code examples
pythonhuggingfacehuggingface-datasetshuggingface-hub

How to load a huggingface dataset from local path?


Take a simple example in this website, https://huggingface.co/datasets/Dahoas/rm-static:

if I want to load this dataset online, I just directly use,

from datasets import load_dataset
dataset = load_dataset("Dahoas/rm-static") 

What if I want to load dataset from local path, so I download the files and keep the same folder structure from web Files and versions fristly,

-data
|-test-00000-of-00001-bf4c733542e35fcb.parquet
|-train-00000-of-00001-2a1df75c6bce91ab.parquet
-.gitattributes
-README.md
-dataset_infos.json

Then, put them into my folder, but shows error when loading:

dataset_path ="/data/coco/dataset/Dahoas/rm-static"
tmp_dataset = load_dataset(dataset_path)

It shows FileNotFoundError: No (supported) data files or dataset script found in /data/coco/dataset/Dahoas/rm-static.


Solution

  • Save the data with save_to_disk then load it with load_from_disk. For example:

    import datasets
    ds = datasets.load_dataset("Dahoas/rm-static") 
    ds.save_to_disk("Path/to/save")
    

    and later if you wanna re-utilize it just normal load_dataset will work

    ds = datasets.load_from_disk("Path/to/save")
    

    you can verify the same by printing the dataset you will be getting same result for both. This is the easier way out. The file format it is generally saved in is arrow.

    For the second method where you are downloading the parquet file. Would require you to explicitly declaring the dataset and it config, might be included in json and then you can load it.