Can Distilled Whisper Models be used as a Drop-In Replacement for OpenAI Whisper?

I have a working video transcription pipeline working using a local OpenAI Whisper model. I would like to use the equivalent distilled model ("distil-small.en"), which is smaller and faster.

transcribe(self):
    file = "/path/to/video"

    model = whisper.load_model("small.en")          # WORKS
    model = whisper.load_model("distil-small.en")   # DOES NOT WORK 

    transcript = model.transcribe(word_timestamps=True, audio=file)
    print(transcript["text"])

However, I get an error that the model was not found:

RuntimeError: Model distil-small.en not found; available models = ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large']

I installed my dependencies in Poetry (which used pip under the hood) as follows:

[tool.poetry.dependencies]
python = "^3.11"
openai-whisper = "*"
transformers  = "*" # distilled whisper models
accelerate  = "*" # distilled whisper models
datasets = { version = "*", extras = ["audio"] } # distilled whisper models

The GitHub Distilled Whisper documentation appears to use a different approach to installing and using these models.

Is it possible to use a Distilled model as a drop-in replacement for a regular Whisper model?

Solution

load_model with a string parameter will only work with OpenAI's known list of models. If you want to use your own model, you will need to download it from the huggingface hub or elsewhere first. See: https://huggingface.co/distil-whisper/distil-small.en#running-whisper-in-openai-whisper

import torch
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from whisper import load_model, transcribe

distil_small_en = hf_hub_download(repo_id="distil-whisper/distil-small.en", filename="original-model.bin")
model = load_model(distil_small_en)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]["array"]
sample = torch.from_numpy(sample).float()

pred_out = transcribe(model, audio=sample)
print(pred_out["text"])

You can also see where OpenAI checks the string parameter of load_model that it only checks the known models (as described in the error you showed)