Search code examples
large-language-modelhuggingfacehuggingface-datasetsopenai-whisperhuggingface-hub

How to recreate the "view" features of common voice v11 in HuggingFace?


The Common Voice v11 on HuggingFace has some amazing View features! They include a dropdown button to select the language, and columns with the dataset features, such as client_id, audio, sentence, etc.

I am building an audio dataset for 7 languages. I have split the audio into multiple parts giving me two sets of files: 1.mp3, 2.mp3, etc. And the corresponding transcription: 1.txt, 2.txt, etc.

These files are distributed into three folders: train, test, and validate for each language.

I am able to upload the data to HuggingFace, but how do I format the View option, such that:

  1. I can select the language from a dropdown list
  2. Then select the type of data: train, test, validate
  3. View three columns: client_id, audio, sentence

Thank you!


Solution

  • It took me a while to find a solution. But here is what worked for me to recreate the view features of Mozilla's Common Voice on HuggingFace:

    Ensure that you use the push_to_hub method with the subset and split parameters.

    Here's a pseudo code outline:

    1. Set up your environment:
      • Install the required libraries.
      • Authenticate with Hugging Face using your token.
    # Install necessary libraries
    pip install datasets huggingface_hub pandas
    
    # Import required modules
    import os
    from datasets import Dataset, Audio
    from huggingface_hub import HfApi, HfFolder
    
    # Authenticate with Hugging Face
    hf_token = 'your_hf_token'
    HfFolder.save_token(hf_token)
    
    1. Create a repository if it doesn't exist:
      • Use the HfApi to check and create the repository.
    # Initialize API
    api = HfApi()
    repo_id = 'your_repo_id'
    
    # Function to create repo if not exists
    def create_repo_if_not_exists(repo_id, hf_token):
        try:
            api.create_repo(repo_id=repo_id, token=hf_token, repo_type='dataset')
        except Exception as e:
            if "409 Client Error: Conflict" in str(e):
                pass  # Repo already exists
            else:
                raise e
    
    1. Load and process your data splits:
      • Iterate through your data directory to load audio and text files.
      • Create a Dataset object for each split.
    # Function to load and process each split
    def load_split(data_dir, split, language):
        data = []
        folder_path = os.path.join(data_dir, language, split)
        if os.path.exists(folder_path):
            for file_name in os.listdir(folder_path):
                if file_name.endswith(".mp3"):
                    audio_path = os.path.join(folder_path, file_name)
                    text_path = os.path.join(folder_path, file_name.replace(".mp3", ".txt"))
                    if os.path.exists(text_path):
                        with open(text_path, "r", encoding="utf-8") as f:
                            sentence = f.read().strip()
                        data.append({"client_id": int(file_name.replace(".mp3", "")), "audio": audio_path, "sentence": sentence})
        if data:
            dataset = Dataset.from_pandas(pd.DataFrame(data))
            return dataset
        return None
    
    1. Push the datasets to Hugging Face Hub:
      • Iterate through languages and splits.
      • Push datasets using the push_to_hub method with appropriate parameters.
    data_dir = 'path_to_your_data_directory'
    languages = ["en", "pt", "de"]
    splits = ["train", "test", "val"]
    
    for language in languages:
        for split in splits:
            dataset = load_split(data_dir, split, language)
            if dataset:
                dataset = dataset.cast_column("audio", Audio())
                create_repo_if_not_exists(repo_id, hf_token)
                dataset.push_to_hub(repo_id, subset=language, split=split)
    

    This approach ensures that the Viewer pane on Hugging Face will display both subset and split dropdowns correctly, provided you have more than one subset.