python machine-learning deep-learning speech-recognition voice-recognition

Cannot Train Wav2vec XLSR Model With Common Voice Data

I am trying to train a transformer ASR model with wav2vec XLSR in the danish language, but whenever I try to pull the danish dataset with datasets library it's giving me an error.. Notebook link

error log:

ValueError: BuilderConfig da not found. Available: ['ab', 'ar', 'as', 'br', 'ca', 'cnh', 'cs', 'cv', 'cy', 'de', 'dv', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'fy-NL', 'ga-IE', 'hi', 'hsb', 'hu', 'ia', 'id', 'it', 'ja', 'ka', 'kab', 'ky', 'lg', 'lt', 'lv', 'mn', 'mt', 'nl', 'or', 'pa-IN', 'pl', 'pt', 'rm-sursilv', 'rm-vallader', 'ro', 'ru', 'rw', 'sah', 'sl', 'sv-SE', 'ta', 'th', 'tr', 'tt', 'uk', 'vi', 'vot', 'zh-CN', 'zh-HK', 'zh-TW']

Solution

I checked it for you.

The Danish language subset is supported in:

Common Voice Corpus 8.0
Common Voice Corpus 9.0

releases.

However, Hugging Face's datasets library (version 2.2.1) uses the 6.1.0 version of the Corpus. You can check yourself this by loading any subset of corpus and printing dataset info as follows:

Code

from datasets import load_dataset

dataset_de = load_dataset("common_voice", "de")
print(dataset_de.info)

Output

Downloading and preparing dataset common_voice/de (download: 21.68 GiB, 
generated: 137.78 MiB, post-processed: Unknown size, total: 21.82 GiB) to 
/root/.cache/huggingface/datasets/common_voice/de/6.1.0/

See the Corpus Details

See the Library

You should wait for a new release of the library or open a request to their repository.