tensorflow keras conv-neural-network tf.keras

Terrible accuracy in keras CNN

I am a beginner to deep learning and I am trying to create a model of handwritten words classification, I created the dataset and it contains 71 different classes with 1000 image of each class.

The problem is that I tried to create a CNN model with different combinations of convolutional, max pooling and dense layers, while also changing the optimizer, but the accuracy remains TERRIBLE.Here are my results.

Is this a problem in the model, in the dataset or my parameters? What do you suggest?

Here is the last model I tried with

model = Sequential([
    Conv2D(32, kernel_size=(2, 2),activation="relu", input_shape=(143, 75, 1)),
    MaxPooling2D(pool_size=(3,3)),
    Conv2D(64, kernel_size=(4, 4),activation="relu"),
    MaxPooling2D(pool_size=(9,9)),
    Flatten(),
    Dense(512, activation="relu"),
    Dense(128, activation="sigmoid"),
    Dense(71, activation="softmax")
])

model.compile(optimizer=Nadam(learning_rate=0.01), loss="categorical_crossentropy", metrics=["accuracy"])

Solution

The problem with your model is your pool size. The official documentation of Keras says this about the pooling layer

Downsamples the input along its spatial dimensions (height and width) by taking the maximum value over an input window (of size defined by pool_size) for each channel of the input. The window is shifted by strides along each dimension.

By default, the pooling layer has a pool size of (2,2) which means that in a matrix window of 4 elements, only the maximum value is taken into consideration.

If we print the summary of your model we get

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 142, 74, 32)       160       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 47, 24, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 44, 21, 64)        32832     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 4, 2, 64)         0         
 2D)                                                             
                                                                 
 flatten (Flatten)           (None, 512)               0         
                                                                 
 dense (Dense)               (None, 512)               262656    
                                                                 
 dense_1 (Dense)             (None, 128)               65664     
                                                                 
 dense_2 (Dense)             (None, 71)                9159      
                                                                 
=================================================================
Total params: 370,471
Trainable params: 370,471
Non-trainable params: 0
_________________________________________________________________

So by looking at those layers and its shapes, there is a big change between conv2d_1(Conv2D) layer output and max_pooling2d_1(MaxPooling2D) layer output. The output shapes changes from (44,21,64) to (4,2,64). This is because you are using a pool size of (9,9) for your Pooling layer before the Dense layer.

To understand the effect of the pool size and pooling, consider the below input image which has a size of (183,183,3).

Now when we apply a 2D max pooling of the above image with a pool size of (2,2), we get the following image whose spatial dimensions are reduced to (91,91,3). Here the image dimensions got reduced but the information within the image is preserved.

Now, for the same input image, the max pooling output for a pool size of (3,3) would be the following image with dimensions of (61,61,3)

Here you can barely see the word Awesome in the image.

with a pool size of (5,5) we get the max pool output as with a spatial dimension of (36,36,3). Here you don't see any information at all.

Why is that ? Because white pixels are 255 and black pixels are 0 and when you do a max pool you always take 255. Now since you are using a pool size of (9,9), you consider more of the white pixels along with black ones in a window and you lose out on the useful information with your spatial dimensions reducing to (20,20,3) just like the case with the pool size of (5,5). (Here only the effect of pooling is shown on the image. When you add the Conv2D layers, the output will change based on the filter values.)

With this your model will not be able to learn anything despite you changing any other components like optimizers or loss functions.

So, change the model architecture to something like below and retrain your network

model = Sequential([
    Conv2D(32, kernel_size=(3, 3),activation="relu", input_shape=(143, 75, 1)),
    MaxPooling2D(),
    Conv2D(64, kernel_size=(3, 3),activation="relu"),
    MaxPooling2D(),
    Flatten(),
    Dense(512, activation="relu"),
    Dense(128, activation="relu"),
    Dense(71, activation="softmax")
])

model.compile(optimizer=tf.keras.optimizers.Nadam(learning_rate=0.01), loss="categorical_crossentropy", metrics=["accuracy"])

P.S: It is a common choice to take the kernel size to be either (3,3) or (5,5) for a convolution layer. For e.g. when you have a convolution layer as Conv2D(64, kernel_size=(3, 3)), you will have 64 filter each with a size of (3,3). Also, don't forget to normalize your images before you feed them to the model.

Cheers !!!