Search code examples
audiomachine-learningsound-recognition

What algorithm is used for audio feature extraction in google's audioset?


I am getting started with Google's Audioset. While the dataset is extensive, I find the information with regards to the audio feature extraction very vague. The website mentions

128-dimensional audio features extracted at 1Hz. The audio features were extracted using a VGG-inspired acoustic model described in Hershey et. al., trained on a preliminary version of YouTube-8M. The features were PCA-ed and quantized to be compatible with the audio features provided with YouTube-8M. They are stored as TensorFlow Record files.

Within the paper, the authors discuss using mel spectrograms on 960 ms chunks to get a 96x64 representation. It is then unclear to me how they get to the 1x128 format representation used in the Audioset. Does anyone know more about this??


Solution

  • They use the 96*64 data as input for a modified VGG network.The last layer of VGG is FC-128, so its output will be 1*128, and that is the reason.

    The architecture of VGG can be found here: https://github.com/tensorflow/models/blob/master/research/audioset/vggish_slim.py