Quit new to using models, i'm trying to use ivrit-ai/whisper-large-v2-tuned model
with a 'long-form' of audio file like they advice here Long-Form Transcription instructions
I'm getting the following error
raise ValueError(
ValueError: Multiple languages detected when trying to predict the most likely target
language for transcription.
It is currently not supported to transcribe to different languages in a single batch.
Please make sure to either force a single language by passing `language='...'` or make sure all input audio is of the same language.`
My code
import torch
from transformers import pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="ivrit-ai/whisper-large-v2-tuned",
chunk_length_s=30,
device=device,
)
audio_file = './audio/sales_call.mp3'
with open(audio_file, 'rb') as file:
audio = file.read()
prediction = pipe(audio, batch_size=8, return_timestamps=True)["chunks"]
with open('transcription.txt', 'w', encoding='utf-8') as file:
for item in prediction:
file.write(f"{item['text']},{item['timestamp']}\n")
Can yout try this piece of code:
pipe(audio, generate_kwargs = {"task":"transcribe", "language":"<|en|>"} )
For your use case it should be:
prediction = pipe(audio, batch_size=8, return_timestamps=True, generate_kwargs = {"task":"transcribe", "language":"<|en|>"})["chunks"]
Reference here