Search code examples
pythonnumpynormalizemfcc

Standarize a 3D NumPy array that has been padded with np.nan


I have a 3D matrix with a shape like (100, 40, 170).

This matrix has been padded to reach the max length of 170 by filling up with np.nan (NaN).

The values in the matrix represent MFCC coefficients from audio data extracted from the UrbanSound8K dataset, using LibRosa (Python).

(Source notebook and data are shared, check the end of the post)

I need to normalize this matrix by the axis=2, by:

  1. Compute the mean over the 3th axis, ignoring elements equal to np.nan
  2. Compute std dev over the 3th axis, ignoring elements equal to np.nan
  3. Subtract the mean to every element that is not equal to np.nan
  4. Divide by the std dev every element that is not equal to np.nan

I have tried many different ways and did not worked. Other posts point to the use of sklearn but the normalization tools from that library are not friendly with 3D matrices... so, by now, this is my best approach:

# Compute mean and std dev matrices (omitting NaN and keeping shapes)
mean = np.nanmean(X_nan, axis=2, keepdims=True)
std = np.nanstd(X_nan, axis=2, keepdims=True)

But then when I subtract and divide I get errors:

X_norm -= mean
X_norm /= std

The Warning message says:

RuntimeWarning: divide by zero encountered in true_divide

And when I check just the first elements of the normalized and original matrices, I see:

# Original
array([[[-58.95327, -58.95327,        -58.95327,       ...,          
                     nan,             nan,            nan],

# Normalized
array([[[-inf,       -inf,            -inf,            ...,
                     inf,             inf,             inf],

Note that the -inf values where introduced when subtracting the mean, not for dividing.

Can you recommend me a way to compute both metrics and do the subtraction and division with NumPy omitting the padded values?

Thank you very much!

The data was generated with this notebook (note that repo is under development!): Urban sound classification with CNN

I have uploaded the data (pickled X and y): MFCC Coeffs X and Y


Solution

  • Please try this solution:

    X_norm = np.where(np.isnan(X_nan), np.nan, X_nan - mean)
    X_norm = np.where(X_norm == 0, 0, X_norm/std)
    

    also give warning, but looks like work correct.

    std can be 0 only when all elements are the same, but in this case the mean is equal to elements and after subtraction you obtain all zeros. So second np.where fix this situation.