python audio deep-learning signals signal-processing

Negative SDR result for evaluating audio source separation

I'm trying to use eval_mus_track function of the museval package to evaluate my audio source separation model. The model I'm evaluating was trained to predict vocals and the results are similar to the actual vocals but the evaluation metrics such as SDR are negative.

Below is my function for generating the metrics:

def estimate_and_evaluate(track):

    #track.audio is stereo therefore we predict each channel separately
    vocals_predicted_channel_1, acompaniment_predicted_channel_1, _ = model_5.predict(np.squeeze(track.audio[:, 0]))
    vocals_predicted_channel_2, acompaniment_predicted_channel_2, _  = model_5.predict(np.squeeze(track.audio[:, 1])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            )


    vocals = np.squeeze(np.array([vocals_predicted_channel_1.wav_file, vocals_predicted_channel_2.wav_file])).T
    accompaniment = np.squeeze(np.array([acompaniment_predicted_channel_1.wav_file, acompaniment_predicted_channel_2.wav_file])).T
    estimates = {
        'vocals': vocals,
        'accompaniment': accompaniment
    }

    scores = museval.eval_mus_track(track, estimates)
    print(scores)

The metric values I get are:

vocals          ==> SDR:  -3.776  SIR:   4.621  ISR:  -0.005  SAR: -30.538  
accompaniment   ==> SDR:  -0.590  SIR:   1.704  ISR:  -0.006  SAR: -16.613

The above result doesn't make sense because first of all, accompaniment prediction is pure noise as this model was trained for vocals but it gets a higher SDR. The second reason is the predicted vocals have a very similar graph to the actual ones but still gets a negative SDR value! In the following graphs, the top one is the actual sound and the bottom one is the predicted source:

Channel 1:

Channel 2: I tried to shift the predicted vocals as mentioned here but the result got worse.

Any idea what's causing this issue?

This is the link to the actual vocals stereo numpy array and this one to the predicted stereo vocals numpy array. you can load and manipulate them by using np.load Thanks for your time

Solution

The signal to distortion ratio is actually the logarithm of a ratio. See equation (12) of this article: https://hal.inria.fr/inria-00630985/PDF/vincent_SigPro11.pdf

So, a SDR of 0 means that the signal is equal to the distortion. An SDR value of less than 0 means that there is more distortion than signal. If the audio doesn't sound like there is more distortion than signal, the cause is often sample alignment problems.

When you look at equation (12), you can see that the calculation depends strongly on preserving the exact sample alignment of the predicted a ground-truth audio. It can be difficult to tell from plots of the waveform or even listening if the samples are misaligned. But, a zoomed-in plot where you can see each individual sample could help you make sure that the ground truth and predicted samples are exactly lined up. If it is shifted by even a single sample, the SDR calculation will not reflect the actual SDR.