Search code examples
parquethuggingfacehuggingface-datasetsfile-structurehuggingface-hub

Why do I get an exception when attempting automatic processing by the Hugging Face parquet-converter?


What file structure should I use on the Hugging Face Hub, if I have a /train.zip archive with PNG image files and an /metadata.csv file with annotations for them, so that the parquet-converter bot can automatically recognize and correctly interpret this dataset?

An example of the desired result

An example of the desired result


But regardless of which file structure I use,

https://huggingface.co/datasets/james-r/so-invalid-image-archive-with-metadata-1

/train.zip
/metadata.csv

or

/train/train.zip
/metadata.csv

I get an exception:

Cannot load the dataset split (in streaming mode) to extract the first rows.
Error code:   StreamingRowsError
Exception:    ValueError
Message:      One or several metadata.csv were found, but not in the same directory or in a parent directory of zip://1.png::hf://datasets/[user]/[repo-name]@[hash]/train/train.zip.
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/split/first_rows.py", line 322, in compute
                  compute_first_rows_from_parquet_response(
                File "/src/services/worker/src/worker/job_runners/split/first_rows.py", line 88, in compute_first_rows_from_parquet_response
                  rows_index = indexer.get_rows_index(
                File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 640, in get_rows_index
                  return RowsIndex(
                File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 521, in __init__
                  self.parquet_index = self._init_parquet_index(
                File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 538, in _init_parquet_index
                  response = get_previous_step_or_raise(
                File "/src/libs/libcommon/src/libcommon/simple_cache.py", line 590, in get_previous_step_or_raise
                  raise CachedArtifactError(
              libcommon.simple_cache.CachedArtifactError: The previous step failed.
              
              During handling of the above exception, another exception occurred:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 96, in get_rows_or_raise
                  return get_rows(
                File "/src/libs/libcommon/src/libcommon/utils.py", line 197, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 73, in get_rows
                  rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1389, in __iter__
                  for key, example in ex_iterable:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 234, in __iter__
                  yield from self.generate_examples_fn(**self.kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 376, in _generate_examples
                  raise ValueError(
              ValueError: One or several metadata.csv were found, but not in the same directory or in a parent directory of zip://1.png::hf://datasets/[user]/[repo-name]@[hash]/train/train.zip.

What am I doing wrong?


Solution

  • It seems this is an issue with the datasets package.

    The workaround for this problem is to convert the metadata.csv file to metadata.jsonl format.

    Here is an example of the recognized file structure.