TranscriptionOptions.new() missing 3 required positional arguments: 'max_new_tokens', 'clip_timestamps', and 'hallucination_silence_threshold'

I need some help with this error, I'm using Whisperx to extract text from an audio file, when I run the code via VSCode (on a Mac mini M2) without the transcription options all works as expected, but when I compose up to docker the code fails to run indicating I'm missing 3 argument, I've tried adding those arguments to my code but that's not working either.

2024-03-20 22:23:15 test-1  | Traceback (most recent call last):
2024-03-20 22:23:15 test-1  |     import speech
2024-03-20 22:23:15 test-1  |   File "/code/speech.py", line 274, in <module>
2024-03-20 22:23:15 test-1  |   File "/code/speech.py", line 69, in get_speeach_to_text
2024-03-20 22:23:15 test-1  |     model = whisperx.load_model("medium", device, compute_type=compute_type)
2024-03-20 22:23:15 test-1  |   File "/usr/local/lib/python3.10/site-packages/whisperx/asr.py", line 332, in load_model
2024-03-20 22:23:15 test-1  |     default_asr_options = faster_whisper.transcribe.TranscriptionOptions(**default_asr_options)
2024-03-20 22:23:15 test-1  | TypeError: TranscriptionOptions.__new__() missing 3 required positional arguments: 'max_new_tokens', 'clip_timestamps', and 'hallucination_silence_threshold'

I'm running --platform=linux/arm64 python:3.10

This is my speech.py file code after adding the transcription options, but in reality I prefer not to include them, and I'm running the latest whisperx:

import whisperx
import faster_whisper

# Define TranscriptionOptions with example values for the required arguments
transcription_options = faster_whisper.transcribe.TranscriptionOptions(
    beam_size=4,
    best_of=1,
    patience=10,
    length_penalty=0.6,
    repetition_penalty=1.2,
    no_repeat_ngram_size=2,
    log_prob_threshold=-20,
    no_speech_threshold=0.5,
    compression_ratio_threshold=0.5,
    condition_on_previous_text=False,
    prompt_reset_on_temperature=True,
    temperatures=[0.7],
    initial_prompt="",
    prefix="",
    suppress_blank=False,
    suppress_tokens=False,
    without_timestamps=True,
    max_initial_timestamp=60,
    word_timestamps=False,
    prepend_punctuations="",
    append_punctuations="",
    max_new_tokens=50,
    clip_timestamps=60,
    hallucination_silence_threshold=0.5
)

device = "cpu"
batch_size = 16  # reduce if low on GPU mem
compute_type = "int8"

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("medium", device, compute_type=compute_type)

audio = whisperx.load_audio(source_audio_file)
result = model.transcribe(audio, language='en', task='translate', batch_size=batch_size, transcription_options=transcription_options)

Any idea what could be wrong here?

Solution

Caveat emptor: I don't know much about this particular stack, but I took a swing at solving this problem. Happy to iterate towards a working solution.

I think that this is a version compatibility issue. The latest version of whisperx on PyPi doesn't appear to include those parameters. You could install the current version of whisperx directly from GitHub (this does have those parameters!).

whisper==1.1.10
faster-whisper==1.0.0
git+https://github.com/m-bain/whisperX.git

This will get you around the TypeError but you will likely get a warning about model versions. But that appears to be a separate issue.

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.1+cu121. Bad things might happen unless you revert torch to 1.x.

Not sure what your Dockerfile looks like, but this is what I used for testing:

FROM python:3.10

WORKDIR /code

COPY requirements.txt .

RUN pip3 install -r requirements.txt

COPY speech.py .

CMD python3 speech.py

At present it seems that the model is being downloaded at run time and this takes a while. Might make more sense to download the model onto the host and then share with the container via a volume mount? Again, see caveat above. :)