I am converting multiple log-mel spectrograms from .wav files to images.
I want to destroy as little information as possible as I plan to use the resulting images for a computer vision task.
To convert the data to an image format, I currently use a simple sklearn.MinMaxScaler((0, 255))
.
To fit this scaler, I use the minimal and the maximal energy of all frequencies on all my spectrograms.
Should I scale my spectrograms with minimal and maximal energy for each specific frequency?
Does it make sense to have different frequencies with different scaling features?
Spectrograms are tricky to use as input to computer vision algorithms, specially to neural networks, due to their skewed, non-normal distribution nature. To tackle this you should:
sklearn.MinMaxScaler((0, 1))
. For classic computer vision, this could be sklearn.MinMaxScaler((0, 255))
So,
Should I scale my spectrograms with minimal and maximal energy for each specific frequency?
Yes, once the normalization is done
and
Does it make sense to have different frequencies with different scaling features?
It depends. For CNNs your input data needs to be consistent for good results. For classic computer vision approaches, could be, depending on what you want to do with it