I am working on a project (Emotion detection from speech or voice tone) for features i am using MFCC which i understand to some extent and know that they are very important feature when it comes to speech.
This is the code i am using from librosa to extract features from my audio files which i am then using in Neural Network for training:
dat, sample_rate = librosa.load(audio_path,res_type='kaiser_fast')
mfccs = np.mean(librosa.feature.mfcc(y=dat, sr=sample_rate,n_mfcc=13).T, axis=0)
What i want to know is that how does taking the average of Mel Frequency coefficients after taking the transpose effects the performance? am i loosing valuable information from my audio file? or should i use the entire Mel Frequency coefficients for training and do some padding technique to make sure the size of the training feature remains the same accross all training audio files as they are of different lengths.
I also looked at other techniques e.g taking the derivatives of mfccs and joining them together but i am still not sure what technique can provide better feature set and provide better classification results eventually.
If these two techniques are not that useful then maybe i should stick with my current approach as shown in the code i.e to take the average and maybe increase my Mel Frequency coefficients number from 13 to higher number.
I think averaging is a bad idea in this case. Because, yes - you loose valuable temporal information. But in context of emotion recognition it is more important that you suppress valuable parts of the signal by averaging with the background. It is well known than emotions are subtle phenomena that may appear only in a short period of time, being hidden the rest of the time.
Since your motivation is to prepare the audio signal for processing with a ML method, I should say that there are plenty of methods to do this properly. Shortly speaking, you process each MFCC frame independently (for example with DNN) and then somehow represent the entire sequence. See this answer for more details and links: How to classify continuous audio
To include static DNN into the dynamic context, combination of DNNs with hidden Markov models was quite popular. The classical paper describing the approach dates back in 2013: https://www.researchgate.net/publication/261500879_Hybrid_Deep_Neural_Network_-_Hidden_Markov_Model_DNN-HMM_based_speech_emotion_recognition
Nowadays, novel methods were developed, for example: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140441.pdf
Given enough data (and skills) for training, you can employ some kind or recurrent neural network, that solves the sequence classification task by design.