The Common Voice v11 on HuggingFace has some amazing View features! They include a dropdown button to select the language, and columns with the dataset features, such as client_id
, audio
, sentence
, etc.
I am building an audio dataset for 7 languages. I have split the audio into multiple parts giving me two sets of files: 1.mp3
, 2.mp3
, etc. And the corresponding transcription: 1.txt
, 2.txt
, etc.
These files are distributed into three folders: train
, test
, and validate
for each language.
I am able to upload the data to HuggingFace, but how do I format the View option, such that:
train
, test
, validate
client_id
, audio
, sentence
Thank you!
It took me a while to find a solution. But here is what worked for me to recreate the view features of Mozilla's Common Voice on HuggingFace:
Ensure that you use the push_to_hub
method with the subset
and split
parameters.
Here's a pseudo code outline:
# Install necessary libraries
pip install datasets huggingface_hub pandas
# Import required modules
import os
from datasets import Dataset, Audio
from huggingface_hub import HfApi, HfFolder
# Authenticate with Hugging Face
hf_token = 'your_hf_token'
HfFolder.save_token(hf_token)
HfApi
to check and create the repository.# Initialize API
api = HfApi()
repo_id = 'your_repo_id'
# Function to create repo if not exists
def create_repo_if_not_exists(repo_id, hf_token):
try:
api.create_repo(repo_id=repo_id, token=hf_token, repo_type='dataset')
except Exception as e:
if "409 Client Error: Conflict" in str(e):
pass # Repo already exists
else:
raise e
Dataset
object for each split.# Function to load and process each split
def load_split(data_dir, split, language):
data = []
folder_path = os.path.join(data_dir, language, split)
if os.path.exists(folder_path):
for file_name in os.listdir(folder_path):
if file_name.endswith(".mp3"):
audio_path = os.path.join(folder_path, file_name)
text_path = os.path.join(folder_path, file_name.replace(".mp3", ".txt"))
if os.path.exists(text_path):
with open(text_path, "r", encoding="utf-8") as f:
sentence = f.read().strip()
data.append({"client_id": int(file_name.replace(".mp3", "")), "audio": audio_path, "sentence": sentence})
if data:
dataset = Dataset.from_pandas(pd.DataFrame(data))
return dataset
return None
push_to_hub
method with appropriate parameters.data_dir = 'path_to_your_data_directory'
languages = ["en", "pt", "de"]
splits = ["train", "test", "val"]
for language in languages:
for split in splits:
dataset = load_split(data_dir, split, language)
if dataset:
dataset = dataset.cast_column("audio", Audio())
create_repo_if_not_exists(repo_id, hf_token)
dataset.push_to_hub(repo_id, subset=language, split=split)
This approach ensures that the Viewer pane on Hugging Face will display both subset and split dropdowns correctly, provided you have more than one subset.