python numpy openai-whisper audiosegment

How to feed a numpy array as audio for whisper model

So I want to open an mp3 using AudioSegment, then I want to convert the AudioSegment object to numpy array and use this numpy array as input for whisper model, I followed this question How to create a numpy array from a pydub AudioSegment? but non of the result was helpful since I get always error like

    Traceback (most recent call last):
  File "E:\Programmi\PythonProjects\whisper_real_time\test\converting_test.py", line 19, in <module>
    result = audio_model.transcribe(arr_copy, language="en", word_timestamps=True,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\transcribe.py", line 121, in transcribe
    mel = log_mel_spectrogram(audio, padding=N_SAMPLES)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\audio.py", line 146, in log_mel_spectrogram
    audio = F.pad(audio, (0, padding))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 86261939712 bytes.

This error is strange because if I provide directly the file like below I get no problems

result = audio_model.transcribe("../audio_test_files/1001_IEO_DIS_HI.mp3", language="en", word_timestamps=True,
                                        fp16=torch.cuda.is_available())

This is the code I wrote

from pydub import AudioSegment
import numpy as np
import whisper
import torch


audio = AudioSegment.from_mp3("../audio_test_files/1001_IEO_DIS_HI.mp3")

dtype = getattr(np, "int{:d}".format(
    audio.sample_width * 8))  # Or could create a mapping: {1: np.int8, 2: np.int16, 4: np.int32, 8: np.int64}
arr = np.ndarray((int(audio.frame_count()), audio.channels), buffer=audio.raw_data, dtype=dtype)
arr_copy = arr.copy()
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper...")
audio_model = whisper.load_model("small", download_root="../models",
                                     device=device)
print(f"Transcribing...")
result = audio_model.transcribe(audio=arr_copy, language="en", word_timestamps=True,
                                        fp16=torch.cuda.is_available())  # , initial_prompt=result.get('text', ""))
text = result['text'].strip()
print(text)

how can I do it?

--------EDIT-------- I edited the code, and now I use the code below. I don't have the error that I had before but the model seems to don't transcribe correctly. I tested what audio I was passing to the model exporting back the wav file, I played it and there is a lot of noise, I can't understand what they are saying so that's why the model does not transcribe. Are the passage of normalization that I am doing ok?

from pydub import AudioSegment
import numpy as np
import whisper
import torch

language = "en"
model = "medium"
model_path = "../models"

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper {model} model {language}...")
audio_model = whisper.load_model(model, download_root=model_path, device=device)

# load wav file with pydub
audio_path = "20230611-004146_audio_chunk.wav"
audio_segment = AudioSegment.from_wav(audio_path)
#audio_segment = audio_segment.low_pass_filter(1000)
# get sample rate
sample_rate = audio_segment.frame_rate
arr = np.array(audio_segment.get_array_of_samples())
arr_copy = arr.copy()
arr_copy = torch.from_numpy(arr_copy)
arr_copy = arr_copy.to(torch.float32)
# normalize
arr_copy = arr_copy / 32768.0
# to device
arr_copy = arr_copy.to(device)


print(f"Transcribing...")
result = audio_model.transcribe(arr_copy, language=language, fp16=torch.cuda.is_available())
text = result['text'].strip()
print(text)

waveform = arr_copy.cpu().numpy()
audio_segment = AudioSegment(
    waveform.tobytes(),
    frame_rate=sample_rate,
    sample_width=waveform.dtype.itemsize,
    channels=1
)
audio_segment.export("test.wav", format="wav")

Solution

If I remember right, internally Whisper operates on 16kHz mono audio segments of 30 seconds. The conversion to the correct format, splitting and padding is handled by transcribe function. This is why when you supply the MP3 path it is working correctly.

If you want to supply numpy array you need to do the format and sample rate conversion by yourself. I suggest you start by creating a short (say 10 sec) audio clip in WAV PCM format. Loading it should provide you an int16 array of 160000 samples (10sec * 16kHz = 160000). Convert the values to float32, and normalize by dividing it by 32768.0. The result should be accepted by Whisper.

audio_segment = AudioSegment.from_mp3(audio_path)

# convert to expected format
if audio_segment.frame_rate != 16000: # 16 kHz
    audio_segment = audio_segment.set_frame_rate(16000)
if audio_segment.sample_width != 2:   # int16
    audio_segment = audio_segment.set_sample_width(2)
if audio_segment.channels != 1:       # mono
    audio_segment = audio_segment.set_channels(1)        
arr = np.array(audio_segment.get_array_of_samples())
arr = arr.astype(np.float32)/32768.0

result = audio_model.transcribe(arr, language=language, fp16=torch.cuda.is_available())
print(result['text'])

If your original audio is noisy then it is hard to expect good results for transcription.