I am trying to save some disk space to use the CommonVoice French dataset (19G) on Google Colab as my Notebook always crashes out of disk space. I saw that from the HuggingFace documentation that we can load a dataset in a streaming mode so we can iterate over it directly without having to download the entire dataset.
. I tried to use that mode in Google Colab, but can't make it work - and I haven't found anything on SO about this issue.
!pip install datasets
!pip install 'datasets[streaming]'
!pip install aiohttp
common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)
Then, I get the following error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-24-489f8a0ca4e4> in <module>()
----> 1 common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)
/usr/local/lib/python3.7/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
811 if not config.AIOHTTP_AVAILABLE:
812 raise ImportError(
--> 813 f"To be able to use dataset streaming, you need to install dependencies like aiohttp "
814 f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance'
815 )
ImportError: To be able to use dataset streaming, you need to install dependencies like aiohttp using "pip install 'datasets[streaming]'" or "pip install aiohttp" for instance
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
Is there a reason why Google Colab wouldn't allow streaming to load a dataset?
Otherwise, what am I missing?
Writing an answer to make it easy for future references. Based on @kkgarg's comment, it seems that the streaming feature is not implemented yet.
!pip install aiohttp
!pip install datasets
from datasets import load_dataset, load_metric
common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)
Triggers the following error:
/usr/local/lib/python3.7/dist-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self, urlpath)
137 elif path.endswith(".zip"):
138 return "zip"
--> 139 raise NotImplementedError(f"Extraction protocol for file at {urlpath} is not implemented yet")
140
141 def download_and_extract(self, url_or_urls):
NotImplementedError: Extraction protocol for file at https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz is not implemented yet
Meaning that the streaming functionality isn't implemented or supported yet. Maybe because using common_voice means that files need to be decompressed and streaming doesn't support that (?). Because the functionality is definitely implemented since it's in the docs...