Search code examples
datasethuggingfacehuggingface-datasets

Dataset library DatasetGenerationError


Strangest error I've encountered, copied straight from hugging face website to start learning audio classifiers:

from datasets import load_dataset, Audio, Dataset

minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

generates the following error: datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

I've tried using Dataset.cleanup_cache_files but that did not help. Why is this error so vague? Any ideas on how to resolve this?

In case it may help, here's the full traceback:

Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\audio.py", line 91, in encode_example
    import soundfile as sf  # soundfile is a dependency of librosa, needed to decode audio files.
ModuleNotFoundError: No module named 'soundfile'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 1693, in _prepare_split_single
    example = self.info.features.encode_example(record) if self.info.features is not None else record
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\features.py", line 1852, in encode_example
    return encode_nested_example(self, example)
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\features.py", line 1229, in encode_nested_example
    {
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\features.py", line 1230, in <dictcomp>
    k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\features.py", line 1284, in encode_nested_example
    return schema.encode_example(obj) if obj is not None else None
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\audio.py", line 93, in encode_example
    raise ImportError("To support encoding audio data, please install 'soundfile'.") from err
ImportError: To support encoding audio data, please install 'soundfile'.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Brandon\Documents\00 School Files 00\University\LLM Research\UAC\uac.py", line 5, in <module>
    minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\load.py", line 2153, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 1717, in _download_and_prepare
    super()._download_and_prepare(
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 1555, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 1712, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Solution

  • TL;DR

    Just install soundfile

    pip install soundfile
    

    The underlying error is in the stacktrace. It's unfortunately a little difficult to read:

    Traceback (most recent call last):
      File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\audio.py", line 91, in encode_example
        import soundfile as sf  # soundfile is a dependency of librosa, needed to decode audio files.
    ModuleNotFoundError: No module named 'soundfile'
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py", line 1693, in _prepare_split_single
        example = self.info.features.encode_example(record) if self.info.features is not None else record
      File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\features.py", line 1852, in encode_example
        return encode_nested_example(self, example)
      File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\features.py", line 1229, in encode_nested_example
        {
      File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\features.py", line 1230, in <dictcomp>
        k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
      File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\features.py", line 1284, in encode_nested_example
        return schema.encode_example(obj) if obj is not None else None
      File "C:\Users\Brandon\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\features\audio.py", line 93, in encode_example
        raise ImportError("To support encoding audio data, please install 'soundfile'.") from err
    ImportError: To support encoding audio data, please install 'soundfile'.
    

    It's complaining about a Python library soundfile that's missing in your environment.