python speech-recognition jupyter-lab huggingface-transformers

huggingface model: Kernel Restarting The kernel for .ipynb appears to have died. It will restart automatically

I am using a pre-trained Huggingface model for Speech Recognition in Spanish to transcribe text from 922 .mp3 files. Nevertheless, after transcribing less than 10 files, it breaks, showing the following message:

Kernel Restarting: The kernel for .ipynb appears to have died. It will restart automatically

I have tried other alternatives mentioned in other questions, like reinstalling conda or mkl, or trying cloud servers like Saturn Cloud or Colab. But the same happens in Saturn Cloud, and in plain Colab the runtime is capped.

The code is the following:

from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-spanish")

# Get file paths to each .mp3 file
root_dir = os.path.join("..", "data", "raw", "audio_files")

# Filter .mp3 files only
file_paths = [os.path.join(root_dir, file) for file in os.listdir(root_dir)]
file_paths = [file for file in file_paths if (os.path.isfile(file) and 
                                              file[-4:] == ".mp3")]

# Transcribe all audios:
transcriptions = []
for file_id in range(len(file_paths)):
    transcript = model.transcribe([file_paths[file_id]])
    transcriptions.append(transcript)

Python: 3.8.5

Files info: .mp3 files of political speeches sampled at 44kHz, even though the package recommends to sample them at 16kHz

Packages info:

NumPy: 1.23.1

SciPy: 1.8.1

Model: jonatasgrosman/wav2vec2-large-xlsr-53-spanish (22nd July 2022 version)

Jupyter notebook packages info:

IPython : 7.29.0

ipykernel : 6.4.1

ipywidgets : 7.6.5

jupyter_client : 7.0.6

jupyter_core : 4.9.1

jupyter_server : 1.4.1

jupyterlab : 3.2.1

nbclient : 0.5.3

nbconvert : 6.1.0

nbformat : 5.1.3

notebook : 6.4.6

qtconsole : 5.2.2

traitlets : 5.1.1

Any help into how to solve this would be much appreciated.

Solution

The HuggingSound creator Here! This is probably happening due to resource issues because these wav2vec2-based models use a lot of memory to perform the transcriptions. Try to increase your machine's RAM (or the VRAM if you're using GPU). If you can't improve your resources, try to split the audio into small chunks before passing them to the transcribe method.