python keras conv-neural-network max-pooling functional-api

Difference between the input shape for a 1D CNN, 2D CNN and 3D CNN

I'm first time building a CNN model for image classification and i'm a little bit confused about what would be the input shape for each type (1D CNN, 2D CNN, 3D CNN) and how to fix the number of filters in the convolution layer. My data is 100x100x30 where 30 are features. Here is my essay for the 1D CNN using the Functional API Keras:

def create_CNN1D_model(pool_type='max',conv_activation='relu'):
    input_layer = (30,1)
    conv_layer1 = Conv1D(filters=16, kernel_size=3, activation=conv_activation)(input_layer)
    max_pooling_layer1 = MaxPooling1D(pool_size=2)(conv_layer1)

    conv_layer2 = Conv1D(filters=32, kernel_size=3, activation=conv_activation)(max_pooling_layer1)
    max_pooling_layer2 = MaxPooling1D(pool_size=2)(conv_layer2)

    flatten_layer = Flatten()(max_pooling_layer2)
    dense_layer = Dense(units=64, activation='relu')(flatten_layer)

    output_layer = Dense(units=10, activation='softmax')(dense_layer)
    CNN_model = Model(inputs=input_layer, outputs=output_layer)
    return CNN_model
CNN1D = create_CNN1D_model()
CNN1D.compile(loss = 'categorical_crossentropy', optimizer = "adam",metrics = ['accuracy'])
Trace = CNN1D.fit(X, y, epochs=50, batch_size=100)

However, while trying the 2D CNN model by just changing Conv1D, Maxpooling1D to Conv2D and Maxpooling2D, i got the following error :

ValueError: Input 0 of layer conv2d_1 is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: (None, 30, 1)

Can anyone please tell me how would be the input shape for 2D CNN and 3D CNN ? And what can be done on input data preprocessing?

Solution

TLDR; your X_train can be looked at as (batch, spatial dims..., channels). A kernel applies to the spatial dimensions for all channels in parallel. So a 2D CNN, would require two spatial dimensions (batch, dim 1, dim 2, channels).

So for (100,100,3) shaped images, you will need a 2D CNN that convolves over 100 height and 100 width, over all the 3 channels.

Lets, understand the above statement.

First, you need to understand what CNN (in general) is doing.

A kernel is convolving through the spatial dimensions of a tensor across its feature maps/channels while performing a simple matrix operation (like dot product) to the corresponding values.

Kernel moves over the spatial dimensions

Now, Let's say you have 100 images (called batches). Each image is 28 by 28 pixels and has 3 channels R, G, B (which are also called feature maps in context to CNNs). If I were to store this data as a tensor, the shape would be (100,28,28,3).

However, I could just have an image that doesn't have any height (may like a signal) OR, I could have data that has an extra spatial dimension such as a video (height, width, and time).

In general, here is how the input for a CNN-based neural network looks like.

Same kernel, all channels

The second key point you need to know is, A 2D kernel will convolve over 2 spatial dimensions BUT the same kernel will do this over all the feature maps/channels. So, if I have a (3,3) kernel. This same kernel will get applied over R, G, B channels (in parallel) and move over the Height and Width of the image.

Operation is a dot product

Finally, the operation (for a single feature map/channel and single convolution window) can be visualized like below.

Therefore, in short -

A kernel gets applied to the spatial dimensions of the data
A kernel shape is equal to the # of spatial dimensions
A kernel applies over all the feature maps/channels at once
The operation is a simple dot product between the kernel and window

Let's take the example of tensors with single feature maps/channels (so, for an image, it would be greyscaled) -

So, with that intuition, we see that if I want to use a 1D CNN, your data must have 1 spatial dimension, which means each sample needs to be 2D (spatial dimension and channels), which means the X_train must be a 3D tensor (batch, spatial dimensions, channels).

Similarly, for a 2D CNN, you would have 2 spatial dimensions (H, W for example) and would be 3D samples (H, W, Channels) and X_train would be (Samples, H, W, Channels)

Let's try this with code -

import tensorflow as tf
from tensorflow.keras import layers

X_2D = tf.random.normal((100,7,3))   #Samples, width/time, channels (feature maps)
X_3D = tf.random.normal((100,5,7,3)) #Samples, height, width, channels (feature maps)
X_4D = tf.random.normal((100,6,6,2,3))   #Samples, height, width, time, channels (feature maps)

For applying 1D CNN -

#With padding = same, the edge pixels are padded to not skip a few

#Out featuremaps = 10, kernel (3,)
cnn1d = layers.Conv1D(10, 3, padding='same')(X_2D) 
print(X_2D.shape,'->',cnn1d.shape)

#(100, 7, 3) -> (100, 7, 10)

For applying 2D CNN -

#Out featuremaps = 10, kernel (3,3)
cnn2d = layers.Conv2D(10, (3,3), padding='same')(X_3D) 
print(X_3D.shape,'->',cnn2d.shape)

#(100, 5, 7, 3) -> (100, 5, 7, 10)

For 3D CNN -

#Out featuremaps = 10, kernel (3,3)
cnn3d = layers.Conv3D(10, (3,3,2), padding='same')(X_4D) 
print(X_4D.shape,'->',cnn3d.shape)

#(100, 6, 6, 2, 3) -> (100, 6, 6, 2, 10)