I'm trying out some of the transcription methods of the SpeechRecognition module.
I was able to transcribe using Google API (recognize_google()
) just fine, but when I try using OpenAPI's Whisper (recognize_whisper()
), a temporary file "%LocalAppData%\Temp\tmps_pfkh0z.wav" (the actual filename changes slightly each time) is created and the script fails with a "permission denied" error:
Traceback (most recent call last):
File "D:\Users\Renato\Documents\Code\projects\transcriber\.venv\lib\site-packages\whisper\audio.py", line 42, in load_audio
ffmpeg.input(file, threads=0)
File "D:\Users\Renato\Documents\Code\projects\transcriber\.venv\lib\site-packages\ffmpeg\_run.py", line 325, in run
raise Error('ffmpeg', out, err)
ffmpeg._run.Error: ffmpeg error (see stderr output for detail)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "d:\Users\Renato\Documents\Code\projects\transcriber\main.py", line 15, in <module>
print("Transcription: " + r.recognize_whisper(audio_data=audio_data, model="medium", language="uk"))
File "D:\Users\Renato\Documents\Code\projects\transcriber\.venv\lib\site-packages\speech_recognition\__init__.py", line 1697, in recognize_whisper
result = self.whisper_model[model].transcribe(
File "D:\Users\Renato\Documents\Code\projects\transcriber\.venv\lib\site-packages\whisper\transcribe.py", line 85, in transcribe
mel = log_mel_spectrogram(audio)
File "D:\Users\Renato\Documents\Code\projects\transcriber\.venv\lib\site-packages\whisper\audio.py", line 111, in log_mel_spectrogram
audio = load_audio(audio)
File "D:\Users\Renato\Documents\Code\projects\transcriber\.venv\lib\site-packages\whisper\audio.py", line 47, in load_audio
libavdevice 59. 7.100 / 59. 7.100
libavfilter 8. 44.100 / 8. 44.100
libswscale 6. 7.100 / 6. 7.100
libswresample 4. 7.100 / 4. 7.100 libpostproc 56. 6.100 / 56. 6.100C:\Users\Renato\AppData\Local\Temp\tmps_pfkh0z.wav: Permission denied
The code itself is pretty straightfoward:
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as src:
audio_data = r.record(src)
print("Transcription: " + r.recognize_whisper(audio_data=audio_data, model="medium", language="en"))
I tried different installations of ffmpeg (gyan.dev and BtbN pre-built packages, and I also tried installing through chocolatey).
I also tried unchecking the "Read-only" option on the Temp folder properties, but the error still happens.
I'm running the script on a virtual environment created with venv, on a Windows machine.
So, I got it to work somehow. The "recognize_whisper
" in the "Recognizer
" class in the speech_recognition "__init__.py
" file there has the line:
with tempfile.NamedTemporaryFile(suffix=".wav") as f:
I guess because I run Windows here (yes, I hate it too...), it somehow gets some permission issues. I replaced it with:
with open('test.wav', 'wb') as f:
Now, the .wav file gets generated and it runs without error. But also without showing the recognition result...
Addition: after playing around with the speech_recognition some more, I think the whisper integration is just not working? I tried giving both whisper and google the same audio file:
AUDIO_FILE = 'test.wav'
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
audio = r.record(source) # read the entire audio file
r.recognize_whisper(audio)
r.recognize_google(audio)
This gives results for the google recognition but not the whisper recognition (and gets the permission error when I replace the old code in the recognize_whisper()
method).
But if I try the same with just whisper (see https://github.com/openai/whisper), this works:
import whisper
model = whisper.load_model("base")
result = model.transcribe("test.wav")
print(result["text"])