python python-3.x ffmpeg openai-api openai-whisper

Transcription via OpenAi's whisper: AssertionError: incorrect audio shape

I'm trying to use OpenAI's open source Whisper library to transcribe audio files.

Here is my script's source code:

import whisper

model = whisper.load_model("large-v2")

# load the entire audio file
audio = whisper.load_audio("/content/file.mp3")
#When i write that code snippet here ==> audio = whisper.pad_or_trim(audio) the first 30 secs are converted and without any problem they are converted.

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions(fp16=False)
result = whisper.decode(model, mel, options)

# print the recognized text if available
try:
    if hasattr(result, "text"):
        print(result.text)
except Exception as e:
    print(f"Error while printing transcription: {e}")

# write the recognized text to a file
try:
    with open("output_of_file.txt", "w") as f:
        f.write(result.text)
        print("Transcription saved to file.")
except Exception as e:
    print(f"Error while saving transcription: {e}")

In here:

# load the entire audio file
audio = whisper.load_audio("/content/file.mp3")

when I write below: " audio = whisper.pad_or_trim(audio) ", the first 30 secs of the sound file is transcribed without any problem and language detection works as well,

but when I delete it and want the whole file to be transcribed, I get the following error:

AssertionError: incorrect audio shape

What should I do? Should I change the structure of the sound file? If yes, which library should I use and what type of script should I write?

Solution

I had the same problem and after some digging I found that whisper.decode is meant to extract metadata about the input, such as the language, and hence the limit to 30 seconds. (see source code for decode function here)

In order to transcribe (even audio longer than 30 seconds) you can use whisper.transcribe as shown in the following snippet

import whisper

model = whisper.load_model("large-v2")

# load the entire audio file
audio = whisper.load_audio("/content/file.mp3")

options = {
    "language": "en", # input language, if omitted is auto detected
    "task": "translate" # or "transcribe" if you just want transcription
}
result = whisper.transcribe(model, audio, **options)
print(result["text"])

You can find some documentation of the transcribe method in the source code along with some documentation about the DecodingOptions structure