Search code examples
pythongoogle-colaboratoryhuggingface-transformershuggingface-datasets

How to load a dataset in streaming mode on Google Colab?


I am trying to save some disk space to use the CommonVoice French dataset (19G) on Google Colab as my Notebook always crashes out of disk space. I saw that from the HuggingFace documentation that we can load a dataset in a streaming mode so we can iterate over it directly without having to download the entire dataset.. I tried to use that mode in Google Colab, but can't make it work - and I haven't found anything on SO about this issue.

!pip install datasets
!pip install 'datasets[streaming]'
!pip install aiohttp

common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

Then, I get the following error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-24-489f8a0ca4e4> in <module>()
----> 1 common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

/usr/local/lib/python3.7/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
    811         if not config.AIOHTTP_AVAILABLE:
    812             raise ImportError(
--> 813                 f"To be able to use dataset streaming, you need to install dependencies like aiohttp "
    814                 f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance'
    815             )

ImportError: To be able to use dataset streaming, you need to install dependencies like aiohttp using "pip install 'datasets[streaming]'" or "pip install aiohttp" for instance

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Is there a reason why Google Colab wouldn't allow streaming to load a dataset?

Otherwise, what am I missing?


Solution

  • Writing an answer to make it easy for future references. Based on @kkgarg's comment, it seems that the streaming feature is not implemented yet.

    !pip install aiohttp
    !pip install datasets
    from datasets import load_dataset, load_metric
    
    common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)
    

    Triggers the following error:

    /usr/local/lib/python3.7/dist-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self, urlpath)
        137         elif path.endswith(".zip"):
        138             return "zip"
    --> 139         raise NotImplementedError(f"Extraction protocol for file at {urlpath} is not implemented yet")
        140 
        141     def download_and_extract(self, url_or_urls):
    
    NotImplementedError: Extraction protocol for file at https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz is not implemented yet
    

    Meaning that the streaming functionality isn't implemented or supported yet. Maybe because using common_voice means that files need to be decompressed and streaming doesn't support that (?). Because the functionality is definitely implemented since it's in the docs...