Update I summarized the question and its answers here
Could you please point me some other approaches, or try to convince me that my current approach still is a good idea or that neural networks may be a feasible way?
Update I already have 2 good answers, but another one would be welcome, and even rewarded.
A step up from convolution is dynamic time warping which can be thought of as a convolution operator that stretches and shrinks one signal to optimally match another.
Perhaps a simpler approach would be to do an FFT of the sample and determine if your insect any particular frequencies that can be filtered on.
On the more complex side, but not quite a neural network, are SVM toolkits like libsvm and svmlight that you can throw your data at.
Regardless of the path you attempt, I would spend time exploring the nature of the sound your insect makes using tools like FFT. After all, it will be easier teaching a computer to classify the sound if you can do it yourself.