Search code examples
pythonpipelinehuggingface-transformers

Speech to text pipelining


Quit new to using models, i'm trying to use ivrit-ai/whisper-large-v2-tuned model

with a 'long-form' of audio file like they advice here Long-Form Transcription instructions

I'm getting the following error

raise ValueError(
ValueError: Multiple languages detected when trying to predict the most likely target
language for transcription. 
It is currently not supported to transcribe to different languages in a single batch.
Please make sure to either force a single language by passing `language='...'` or make sure all input audio is of the same language.`

My code

import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="ivrit-ai/whisper-large-v2-tuned",
  chunk_length_s=30,
  device=device,
)

audio_file = './audio/sales_call.mp3'  

with open(audio_file, 'rb') as file:
    audio = file.read()

prediction = pipe(audio, batch_size=8, return_timestamps=True)["chunks"]

with open('transcription.txt', 'w', encoding='utf-8') as file:
    for item in prediction:
        file.write(f"{item['text']},{item['timestamp']}\n")

Solution

  • Can yout try this piece of code:

    pipe(audio, generate_kwargs = {"task":"transcribe", "language":"<|en|>"} )
    

    For your use case it should be:

    prediction = pipe(audio, batch_size=8, return_timestamps=True, generate_kwargs = {"task":"transcribe", "language":"<|en|>"})["chunks"]
    
    

    Reference here