Convert to log-spectogram results in length 0 mel spectogram

I have a long video that I wish to isolate only a part of, and extract the log-spectogram for the corresponding isolated part. I am using moviepy to load the original file, which is in mp4 format. Then I use subclip to extract the relevant part, and .audio to refer only to the audio of the file.
I There is an option to extract the audio data and the sampling rate, just like when loading using librosa. The full code of the audio data and sampling rate extraction is the following:

with VideoFileClip(input_file) as video:
    # Use the first three seconds of the video
    clip = video.subclip(0, 3)
    # Get the audio data and sample rate
    y =
    sr =
    l =

print(f'y:{y.shape}, sr:{sr}, length:{l}')

And it is resulted in:

>>>  y:(132300, 2), sr:44100, length:3

Next, I wish to convert the above data into spectrogram. When I try the following, my machine crushes, or I get an error.

with VideoFileClip(input_file) as video:
    # Trim video
    clip = video.subclip(start_time_sec, end_time_sec)
    # Get length of the trimed video
    length = end_time_sec-start_time_sec

    # Get the audio data and sample rate
    y =
    sr =
    l =

    # Do something with the audio data
    spectrogram = librosa.feature.melspectrogram(y=y, n_fft=2048, hop_length=512)
    librosa.display.specshow(spectrogram, sr=sr)


Output exceeds the size limit. Open the full output data in a text editor---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[33], line 16
     14 # Do something with the audio data
     15 spectrogram = librosa.feature.melspectrogram(y=y, n_fft=2048, hop_length=512)
---> 16 librosa.display.specshow(spectrogram, sr=sr)

File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/librosa/, in specshow(data, x_coords, y_coords, x_axis, y_axis, sr, hop_length, n_fft, win_length, fmin, fmax, tuning, bins_per_octave, key, Sa, mela, thaat, auto_aspect, htk, unicode, intervals, unison, ax, **kwargs)
   1211 x_coords = __mesh_coords(x_axis, x_coords, data.shape[1], **all_params)
   1213 axes = __check_axes(ax)
-> 1215 out = axes.pcolormesh(x_coords, y_coords, data, **kwargs)
   1217 __set_current_image(ax, out)
   1219 # Set up axis scaling

File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/matplotlib/, in _preprocess_data..inner(ax, data, *args, **kwargs)
   1439 @functools.wraps(func)
   1440 def inner(ax, *args, data=None, **kwargs):
   1441     if data is None:
-> 1442         return func(ax, *map(sanitize_sequence, args), **kwargs)
   1444     bound = new_sig.bind(ax, *args, **kwargs)
   1445     auto_label = (bound.arguments.get(label_namer)
   1446                   or bound.kwargs.get(label_namer))

File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/matplotlib/axes/, in Axes.pcolormesh(self, alpha, norm, cmap, vmin, vmax, shading, antialiased, *args, **kwargs)
   6225     C = C.ravel()
   1984             f"shading, A should have shape "
   1985             f"{' or '.join(map(str, ok_shapes))}, not {A.shape}")
   1986 return super().set_array(A)

ValueError: For X (129) and Y (132301) with flat shading, A should have shape (132300, 128, 3) or (132300, 128, 4) or (132300, 128) or (16934400,), not (132300, 128, 1)

Finally, when I use power_to_db and plt.imshow as in the following code

ps = librosa.feature.melspectrogram(y=y, sr=sr)
ps_db= librosa.power_to_db(ps)
# librosa.display.specshow(ps_db, x_axis='s', y_axis='log')
plt.imshow(ps_db, origin="lower", cmap=plt.get_cmap("magma"))

I get the following undesired result: enter image description here

Is it something with the overlap size or something?


  • The librosa multi-channel format is channels-first, where as your audio seems to be channels-last. Try y =, to convert it.

    Also it there might be problems with passing a stereo mel-spectrogram, to librosa.display.specshow. If it is acceptable to work in mono, then convert the audio using y = librosa.to_mono(y) before doing the processing.