So I want to open an mp3 using AudioSegment, then I want to convert the AudioSegment object to numpy array and use this numpy array as input for whisper model, I followed this question How to create a numpy array from a pydub AudioSegment? but non of the result was helpful since I get always error like
Traceback (most recent call last):
File "E:\Programmi\PythonProjects\whisper_real_time\test\converting_test.py", line 19, in <module>
result = audio_model.transcribe(arr_copy, language="en", word_timestamps=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\transcribe.py", line 121, in transcribe
mel = log_mel_spectrogram(audio, padding=N_SAMPLES)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\audio.py", line 146, in log_mel_spectrogram
audio = F.pad(audio, (0, padding))
^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 86261939712 bytes.
This error is strange because if I provide directly the file like below I get no problems
result = audio_model.transcribe("../audio_test_files/1001_IEO_DIS_HI.mp3", language="en", word_timestamps=True,
fp16=torch.cuda.is_available())
This is the code I wrote
from pydub import AudioSegment
import numpy as np
import whisper
import torch
audio = AudioSegment.from_mp3("../audio_test_files/1001_IEO_DIS_HI.mp3")
dtype = getattr(np, "int{:d}".format(
audio.sample_width * 8)) # Or could create a mapping: {1: np.int8, 2: np.int16, 4: np.int32, 8: np.int64}
arr = np.ndarray((int(audio.frame_count()), audio.channels), buffer=audio.raw_data, dtype=dtype)
arr_copy = arr.copy()
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper...")
audio_model = whisper.load_model("small", download_root="../models",
device=device)
print(f"Transcribing...")
result = audio_model.transcribe(audio=arr_copy, language="en", word_timestamps=True,
fp16=torch.cuda.is_available()) # , initial_prompt=result.get('text', ""))
text = result['text'].strip()
print(text)
how can I do it?
--------EDIT-------- I edited the code, and now I use the code below. I don't have the error that I had before but the model seems to don't transcribe correctly. I tested what audio I was passing to the model exporting back the wav file, I played it and there is a lot of noise, I can't understand what they are saying so that's why the model does not transcribe. Are the passage of normalization that I am doing ok?
from pydub import AudioSegment
import numpy as np
import whisper
import torch
language = "en"
model = "medium"
model_path = "../models"
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper {model} model {language}...")
audio_model = whisper.load_model(model, download_root=model_path, device=device)
# load wav file with pydub
audio_path = "20230611-004146_audio_chunk.wav"
audio_segment = AudioSegment.from_wav(audio_path)
#audio_segment = audio_segment.low_pass_filter(1000)
# get sample rate
sample_rate = audio_segment.frame_rate
arr = np.array(audio_segment.get_array_of_samples())
arr_copy = arr.copy()
arr_copy = torch.from_numpy(arr_copy)
arr_copy = arr_copy.to(torch.float32)
# normalize
arr_copy = arr_copy / 32768.0
# to device
arr_copy = arr_copy.to(device)
print(f"Transcribing...")
result = audio_model.transcribe(arr_copy, language=language, fp16=torch.cuda.is_available())
text = result['text'].strip()
print(text)
waveform = arr_copy.cpu().numpy()
audio_segment = AudioSegment(
waveform.tobytes(),
frame_rate=sample_rate,
sample_width=waveform.dtype.itemsize,
channels=1
)
audio_segment.export("test.wav", format="wav")
If I remember right, internally Whisper operates on 16kHz mono audio segments of 30 seconds. The conversion to the correct format, splitting and padding is handled by transcribe
function. This is why when you supply the MP3 path it is working correctly.
If you want to supply numpy
array you need to do the format and sample rate conversion by yourself. I suggest you start by creating a short (say 10 sec) audio clip in WAV
PCM format. Loading it should provide you an int16
array of 160000 samples (10sec * 16kHz = 160000). Convert the values to float32
, and normalize by dividing it by 32768.0
. The result should be accepted by Whisper.
audio_segment = AudioSegment.from_mp3(audio_path)
# convert to expected format
if audio_segment.frame_rate != 16000: # 16 kHz
audio_segment = audio_segment.set_frame_rate(16000)
if audio_segment.sample_width != 2: # int16
audio_segment = audio_segment.set_sample_width(2)
if audio_segment.channels != 1: # mono
audio_segment = audio_segment.set_channels(1)
arr = np.array(audio_segment.get_array_of_samples())
arr = arr.astype(np.float32)/32768.0
result = audio_model.transcribe(arr, language=language, fp16=torch.cuda.is_available())
print(result['text'])
If your original audio is noisy then it is hard to expect good results for transcription.