How to classify sound using FFT and neural network? Should I use CNN or RNN?

I am doing a personal project for educational purpose to learn Keras and machine learning. For start, I would like to classify if a sound is a clap or stomp.

I am using a microcontroller that is sound triggered and samples sound @ 20usec. And the microcontroller will send this raw ADC data to the PC for Python processing. I am currently taking 1000 points and get the FFT using numpy (using rfft and getting its absolute value).

Now, I would like to feed the captured FFT signals for clap or stomp as a training data to classify them using neural network. I had been researching for the whole day regarding this and some articles say the Convolutional Neural Network should be used and some say Recurrent Neural Network should be used.

I looked at Convolutional Neural Network and it raised another question, if I should be using Keras' 1-D or 2-D Conv.

Solution

You need to process the FFT signals to classify whether the sound is a clap or a stomp.

For Convolutional Neural Networks ( CNN):

CNNs can extract features from fixed length inputs. 1D CNNs with Max-Pooling work the best on signal data ( I have personally used them over accelerometer data).

You can use them if your input is fixed length and has significant features.

For Recurrent Neural Networks:

Should be used when the signal has a temporal feature.

Temporal features ( for example ) could be thought in this way for the recognition of a clap. A clap has immediate high-raised sound followed by a soft sound ( when the clap ends ). An RNN will learn these two features ( mentioned above ) in a sequence. And also clapping is a sequential action ( it consists of various activities in sequence ).

RNNs and LSTMs can be the best choice if they receive excellent features.

An hybrid Conv LSTM:

This NN is a hybrid of CNN and LSTMs ( RNN ). They use CNNs for feature extraction and then this sequence is learned by LSTMs. The features extracted by the CNNs also contain temporal features.

This could be super easy if you are using Keras.

Tip:

As audio classification is performed, I will also suggest the use of MFCC to extract features.

I think you should try all the 3 approaches and see which suits the best. Most probably RNNs and ConvLSTMs will work for your use case.

Hope it helps.