Search code examples
filenotfoundexceptionhuggingfacehuggingface-datasets

HuggingFace load_dataset error (.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow')


I'm following a tutorial to fine-tune a model, but have been stuck in a load_dataset error I can't solve. For context, the tutorial first uploaded this dataset to HF, and I managed to upload an identical one.

When I run a script to download the dataset, however, the problem appears. If I'm downloading the original dataset, the process goes well and all files are fetched correctly. But when I try downloading mine, it seems like I get to download part of the files (until the 0.0.0 folder you'll see in the error message, but nothing after that).

The command I'm running is dataset = load_dataset("FelipeBandeiraPoatek/invoices-donut-data-v2", split="train"), and the error log I'm getting is the following:

Downloading data files: 100%|████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
Extracting data files: 100%|████████████████████████████████████████| 3/3 [00:00<00:00, 198.67it/s] 
Traceback (most recent call last):
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1852, in _prepare_split_single
    writer = writer_class(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\arrow_writer.py", line 334, in __init__
    self.stream = self._fs.open(fs_token_paths[2][0], "wb")
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\spec.py", line 1241, in open
    f = self._open(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 184, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 315, in __init__
    self._open()
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 320, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Felipe Bandeira/.cache/huggingface/datasets/FelipeBandeiraPoatek___parquet/FelipeBandeiraPoatek--invoices-donut-data-v2-ca49e83826870faf/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 11, in <module>
    main()
  File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 7, in main   
    dataset_tester.test("FelipeBandeiraPoatek/invoices-donut-data-v2")
  File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\tools\donut\dataset_tester.py", line 10, in test
    dataset = load_dataset(dataset_name, split="train")
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1749, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1892, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

I haven't found any solutions for this and cannot figure out why the original dataset is downloaded well, but mine (which is identical) does not. Any clues?

(I have tried:

  1. inspecting the functions that download dataset
  2. inspecting error logs
  3. deleting folders that store the downloads in my pc and repeating the process
  4. cloning the files from the original repo on a repo of my own

In all of the cases, I can download the dataset correctly from the original repo, but not from my own. The same error keeps happening)


Solution

  • I also encountered this problem in the end I found that the file name is too long beyond the system naming length limit, the file name will be changed to a shorter on the line!