I'm following a tutorial to fine-tune a model, but have been stuck in a load_dataset error I can't solve. For context, the tutorial first uploaded this dataset to HF, and I managed to upload an identical one.
When I run a script to download the dataset, however, the problem appears. If I'm downloading the original dataset, the process goes well and all files are fetched correctly. But when I try downloading mine, it seems like I get to download part of the files (until the 0.0.0 folder you'll see in the error message, but nothing after that).
The command I'm running is dataset = load_dataset("FelipeBandeiraPoatek/invoices-donut-data-v2", split="train")
, and the error log I'm getting is the following:
Downloading data files: 100%|████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
Extracting data files: 100%|████████████████████████████████████████| 3/3 [00:00<00:00, 198.67it/s]
Traceback (most recent call last):
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1852, in _prepare_split_single
writer = writer_class(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\arrow_writer.py", line 334, in __init__
self.stream = self._fs.open(fs_token_paths[2][0], "wb")
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\spec.py", line 1241, in open
f = self._open(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 184, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 315, in __init__
self._open()
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 320, in _open
self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Felipe Bandeira/.cache/huggingface/datasets/FelipeBandeiraPoatek___parquet/FelipeBandeiraPoatek--invoices-donut-data-v2-ca49e83826870faf/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 11, in <module>
main()
File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 7, in main
dataset_tester.test("FelipeBandeiraPoatek/invoices-donut-data-v2")
File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\tools\donut\dataset_tester.py", line 10, in test
dataset = load_dataset(dataset_name, split="train")
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\load.py", line 1782, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 967, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1749, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1892, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
I haven't found any solutions for this and cannot figure out why the original dataset is downloaded well, but mine (which is identical) does not. Any clues?
(I have tried:
In all of the cases, I can download the dataset correctly from the original repo, but not from my own. The same error keeps happening)
I also encountered this problem in the end I found that the file name is too long beyond the system naming length limit, the file name will be changed to a shorter on the line!