Search code examples
pythonmachine-learningpyaudiomono-embedding

ParameterError: Mono data must have shape (samples,). Received shape=(1, 87488721)


Currently I am working speaker Diarization on python where I am using pyannote for embedding. My embedding function looks like this:

import torch
import librosa
from pyannote.core import Segment

def embeddings_(audio_path,resegmented,range):
  model_emb = torch.hub.load('pyannote/pyannote-audio', 'emb')
  
  embedding = model_emb({'audio': audio_path})
  for window, emb in embedding:
    assert isinstance(window, Segment)
    assert isinstance(emb, np.ndarray)

  y, sr = librosa.load(audio_path)
  myDict={}
  myDict['audio'] = audio_path
  myDict['duration'] = len(y)/sr

  data=[]
  for i in resegmented:
    excerpt = Segment(start=i[0], end=i[0]+range)
    emb = model_emb.crop(myDict,excerpt)
    data.append(emb.T)
  data= np.asarray(data)
  
  return data.reshape(len(data),512)

When I run

embeddings = embeddings_(audiofile,resegmented,2)

I get this error:

ParameterError: Mono data must have shape (samples,). Received shape=(1, 87488721)

Solution

  • I got the same error too, but i have found a workaround. For me, the error got triggered in "pyannote/audio/features/utils.py", when it is trying to resample the audio using this line y = librosa.core.resample(y.T, sample_rate, self.sample_rate).T

    This is my workaround

        def get_features(self, y, sample_rate):
    
            # convert to mono
            if self.mono:
                y = np.mean(y, axis=1, keepdims=True)
                y = np.squeeze(y)    # Add this line
            
            # resample if sample rates mismatch
            if (self.sample_rate is not None) and (self.sample_rate != sample_rate):
                y = librosa.core.resample(y.T, sample_rate, self.sample_rate).T
                sample_rate = self.sample_rate
    
            # augment data
            if self.augmentation is not None:
                y = self.augmentation(y, sample_rate)
    
            # TODO: how time consuming is this thing (needs profiling...)
            if len(y.shape) == 1:     # Add this line
                y = y[:,np.newaxis]   # Add this line
                
            try:
                valid = valid_audio(y[:, 0], mono=True)
            except ParameterError as e:
                msg = f"Something went wrong when augmenting waveform."
                raise ValueError(msg)
    
            return y
    

    Use np.squeeze on y for librosa.core.resample, then use y[:,np.newaxis] to change its shape to (samples, 1) for valid = valid_audio(y[:, 0], mono=True)