keras deep-learning conv-neural-network keras-layer

Creating a CNN Model in Keras with feature maps from each of the previous filtered images

I am trying to implement the artificial convolutional neural network in order to perform a two-class pixel-wise classification as seen in the figure attached (from Chen et al. Nature 2017).

Can you give me a hint on what the third and fourth layers should look like?

This is how far I've got already:

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D

model = Sequential()
model.add(Conv2D(40, (15, 15), activation='relu', 
          padding='same', input_shape = (64, 64, 1)))  # first layer
model.add(MaxPooling2D((2, 2), padding='same'))  # second layer
# model.add(...)  # third layer  <-- how to implement this?
# model.add(...)  # fourth layer <-- how to implement this?
print(model.summary())

How many kernels did they use for the remaining layers and how should I interpret the summation symbols in the image?

Thanks in advance!

Solution

~~The actual question is rather ambiguous. I am guessing correctly, that you want someone to implement the missing two lines of code for the network?~~

model = Sequential()
model.add(Conv2D(40, (15, 15), activation='relu', 
          padding='same', input_shape=(64, 64, 1)))
model.add(MaxPooling2D((2, 2), padding='same'))
model.add(Conv2D(40, (15, 15), activation='relu', padding='same'))  # layer 3
model.add(Conv2D(1, (15, 15), activation='linear', padding='same'))  # layer 4
print(model.summary())

To get 40 feature maps after layer 3, we just convolve with 40 different kernels. After layer 4, there should be only one feature map / channel, so 1 kernel is enough here.

By the way, the figure seems to be from Convolutional neural networks for automated annotation of cellular cryo-electron tomograms (PDF) by Chen et al., a Nature article from 2017.

Update:

Comment: [...] why the authors say 1600 kernels in total and there is a summation?

Actually, the authors seem to follow a rather strange notation here. They have an (imho) incorrect way to count kernels. What they rather mean is weights (if given 1x1 kernels...).

Maybe they did not understand that the shape of the kernels are in fact 3-D, due to the last dimension equal to the number of feature maps.

When we break it down there are

40 kernels of size 15x15x1 for the 1st layer (which makes 40 * 15 ** 2 trainable weights)
No kernels in the 2nd layer
40 kernels of size 15x15x40 in the 3rd layer (which makes 1600 * 15 ** 2 trainable weights)
1 kernel of size 15x15x40 for the 4th layer (which makes 40 * 15 ** 2 trainable weights)