tensorflow artificial-intelligence feature-extraction

What is vggish_model.ckpt and vggish_pca_params.npz

I am trying to understand some aspects of audio classification and came by "vggish_model.ckpt" and "vggish_pca_params.npz". I am trying to have a good understanding of these two. Are they part of tensorflow or google audio set? Why do I need to use them when building audio features? I couldn't see any documentation about them!

Solution

The precalculated features released with AudioSet are "embeddings" from a deep net that was trained to predict video-level tags from soundtracks (see https://arxiv.org/abs/1609.09430). The embedding layer is further processed via PCA to reduce dimensionality; this processing is included to make the features compatible with the ones release in https://research.google.com/youtube8m/ . So, vggish_model.ckpt gives the weights of the VGG-like deep CNN used to calculate the embedding from mel-spectrogram patches, and vggish_pca_params.npz gives the bases for the PCA transformation.

The only content released as part of AudioSet are these precalculated embedding features. If you train a model based on these features, then want to use it to classify new inputs, you must convert the new input to the same domain, and thus you have to use vggish_model and vggish_pca_params.

If AudioSet had included waveforms, none of this would be needed. But YouTube terms of service do not allow download and redistribution of its users' content.