Search code examples
pythonmachine-learningdeep-learningspeech-recognitionvoice-recognition

Cannot Train Wav2vec XLSR Model With Common Voice Data


I am trying to train a transformer ASR model with wav2vec XLSR in the danish language, but whenever I try to pull the danish dataset with datasets library it's giving me an error.. Notebook link

error log:

ValueError: BuilderConfig da not found. Available: ['ab', 'ar', 'as', 'br', 'ca', 'cnh', 'cs', 'cv', 'cy', 'de', 'dv', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'fy-NL', 'ga-IE', 'hi', 'hsb', 'hu', 'ia', 'id', 'it', 'ja', 'ka', 'kab', 'ky', 'lg', 'lt', 'lv', 'mn', 'mt', 'nl', 'or', 'pa-IN', 'pl', 'pt', 'rm-sursilv', 'rm-vallader', 'ro', 'ru', 'rw', 'sah', 'sl', 'sv-SE', 'ta', 'th', 'tr', 'tt', 'uk', 'vi', 'vot', 'zh-CN', 'zh-HK', 'zh-TW']


Solution

  • I checked it for you.

    The Danish language subset is supported in:

    • Common Voice Corpus 8.0
    • Common Voice Corpus 9.0

    releases.

    However, Hugging Face's datasets library (version 2.2.1) uses the 6.1.0 version of the Corpus. You can check yourself this by loading any subset of corpus and printing dataset info as follows:

    Code

    from datasets import load_dataset
    
    dataset_de = load_dataset("common_voice", "de")
    print(dataset_de.info)
    

    Output

    Downloading and preparing dataset common_voice/de (download: 21.68 GiB, 
    generated: 137.78 MiB, post-processed: Unknown size, total: 21.82 GiB) to 
    /root/.cache/huggingface/datasets/common_voice/de/6.1.0/
    

    See the Corpus Details

    See the Library

    You should wait for a new release of the library or open a request to their repository.