I can successfully use the whisper cli to transcribe an audio wav file. I use the command:
whisper --language en --model tiny --device cpu .tmp/audio/chunk1.wav
Located here, and using python 3.11:
dev@host ~/Development $ whereis whisper
whisper: /home/dev/Development/whispervm/.direnv/python-3.11/bin/whisper
Then I create a script that in theory should do the exact same thing, but it recognizes my nvidia card, attempts to use cuda and fails even when I explicitly state I want to use the "cpu" device.
#!/usr/bin/env python
import whisper
# whisper has multiple models that you can load as per size and requirements
model = whisper.load_model("tiny").to("cpu")
# path to the audio file you want to transcribe
PATH = ".tmp/audio/chunk1.wav"
result = model.transcribe(PATH, fp16=False)
print(result["text"])
Output is this:
Found GPU0 Quadro K4000 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is 3.7.
warnings.warn(old_gpu_warn % (d, name, major, minor, min_arch // 10, min_arch % 10))
Traceback (most recent call last):
File "/home/dev/Development/whisper/test.py", line 2, in <module>
model = whisper.load_model("tiny").to("cpu")
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dev/Development/whispervm/.direnv/python-3.11/lib/python3.11/site-packages/whisper/__init__.py", line 149, in load_model
model.load_state_dict(checkpoint["model_state_dict"])
File "/home/dev/Development/whispervm/.direnv/python-3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Whisper:
While copying the parameter named "encoder.blocks.0.attn.query.weight", whose dimensions in the model are torch.Size([384, 384]) and whose dimensions in the checkpoint are torch.Size([384, 384]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n',).
While copying the parameter named "encoder.blocks.0.attn.key.weight", whose dimensions in the model are torch.Size([384, 384]) and whose dimensions in the checkpoint are torch.Size([384, 384]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n',).
and many more parameter errors. It does not transcribe. I'm thinking this might be a bug.
tl;dr: Whisper will not transcribe on the cpu as a python script when it does on the cli
Edit: installed pip packages list
Package Version
------------------------ ----------
bcrypt 4.0.1
certifi 2023.7.22
cffi 1.16.0
charset-normalizer 3.3.0
cmake 3.27.6
cryptography 41.0.4
decorator 5.1.1
Deprecated 1.2.14
fabric 3.2.2
filelock 3.12.4
idna 3.4
invoke 2.2.0
Jinja2 3.1.2
lit 17.0.2
llvmlite 0.41.0
MarkupSafe 2.1.3
more-itertools 10.1.0
mpmath 1.3.0
networkx 3.1
numba 0.58.0
numpy 1.25.2
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
openai-whisper 20230918
paramiko 3.3.1
pip 23.2.1
pycparser 2.21
pydub 0.25.1
PyNaCl 1.5.0
regex 2023.10.3
requests 2.31.0
setuptools 68.1.2
sympy 1.12
tiktoken 0.3.3
torch 2.0.1
tqdm 4.66.1
triton 2.0.0
typing_extensions 4.8.0
urllib3 2.0.6
wheel 0.41.2
wrapt 1.15.0
Found the error.
I looked at the source code and it seems that I need to pass the device in the load_model()
function call as opposed to what I was reading on blogs.
So the correct script looks like this:
import whisper
audio_file = "/home/dev/Development/whispervm/.tmp/audio/chunk1.wav"
audio = whisper.load_audio(audio_file)
model = whisper.load_model("tiny", device='cpu')
result = model.transcribe(audio)
print(result["text"])
I read that if you don't specify the device, it's supposed to default to cpu. If cuda can and is detected, it then defaults to cuda and when your card is too old for the more recent versions, it fails.