audio neural-network keras convolution keras-layer

Processing a long audio signal with Conv1D in Keras

I have a long audio signal x which is a 1D list of 100000 samples.

For simplicity, let's say all I want to do is convolve it with a length 15 filter, and end up outputting a target filtered signal y of 100000 samples.

So basically, I'm trying to do y = conv(x, h) with a 1D CNN, and the filter h is to be trained.

What is the best way to do this in Keras? All the examples I find seem to be of the form "Each of the samples is a sequence of length 400 words, and convolution is run along that sequence of 400 words". From that, it seems my only choice is to break the audio signal into chunks of size sequence_length, but I'd really rather avoid this, since I basically only have 1 input sequence of length 100000.

Ideally, the code would look like

import matplotlib.pylab as P
from keras.models import Model
from keras.layers import Conv1D, Input

x_train = P.randn(100000)
y_train = 2*x_train
x_val = P.randn(10000)
y_val = 2*x_val

batch_size = 64

myinput = Input(shape=(None, 1)) # shape = (BATCH_SIZE, 1D signal)
output = Conv1D(
    1, # output dimension is 1
    15, # filter length is 15
    padding="same")(myinput)

model = Model(inputs=myinput, outputs=output)

model.compile(loss='mse',
              optimizer='rmsprop',
              metrics=['mse'])


model.fit(x_train, y_train,
          batch_size=batch_size, epochs=100, shuffle=False,
          validation_data=(x_val, y_val))

Of course, the big problem here is shaping things correctly.

Solution

You're totally on the right track.

Although you have one sound sample (example*), keras will still suppose you have many. The solution is simply have a dimension for that in your input.

Also, keras will expect that your data for convolution has "channels". If you have only one channel (not stereo, for instance), then, have a dimension for it with value one.

So, your input data should be shaped as:

(1, 100000, 1) - if using the data_format='channels_last' (default)
(1, 1, 100000) - if using the data_format='chanels_first'

This means: 1 sample of a signal with length 100000 and one channel.

All the rest in your model seems pretty fine for the task.

If your memory cannot support the entire data at once, then you'd need to divide your audio in chunks. Otherwise, you're good to go. (Notice that when dividing, you might get better results using padding='valid', because "same" would be adding lots of border effects in the cuts).

You might be interested in reading about WaveNet and its related article.

They use stacked convolutional layers with dilation rates.

* - In Keras, each "example" is called "sample", although in audio processing, samples are usually the timesteps. So, a complete audio file would be a "sample" in Keras.