python deep-learning conv-neural-network gradient-descent

Problem building CNN only using python numpy when gradient descent and batching

I am currently learning the book Grokking Deep Learning by Andrew W. Trask. But I have problems understanding the code in Chapter 10 of the book, on building a CNN only using python and numpy:

import numpy as np, sys
np.random.seed(1)
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images, labels = (x_train[0:1000].reshape(1000, 28*28)/255, y_train[0:1000])
one_hot_labels = np.zeros((len(labels), 10))
for i, l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels
test_images = x_test.reshape(len(x_test), 28*28)/255
test_labels = np.zeros((len(y_test), 10))
for i, l in enumerate(y_test):
    test_labels[i][l] = 1
def tanh(x):
    return np.tanh(x)
def tanh2deriv(output):
    return 1-(output**2)
def softmax(x):
    temp = np.exp(x)
    return temp/np.sum(temp, axis=1, keepdims=True)
alpha, iterations = (2, 300)
pixels_per_image, num_labels = (784, 10)
batch_size = 128
input_rows = 28
input_cols = 28
kernel_rows = 3
kernel_cols = 3
num_kernels = 16
hidden_size = ((input_rows-kernel_rows)*(input_cols-kernel_cols))*num_kernels
kernels = 0.02*np.random.random((kernel_rows*kernel_cols,num_kernels))-0.01 
weights_1_2 = 0.2*np.random.random((hidden_size, num_labels))-0.1 
def get_image_section(layer, row_from, row_to, col_from, col_to): 
    section = layer[:,row_from:row_to, col_from:col_to]
    return section.reshape(-1, 1, row_to-row_from, col_to-col_from)
for j in range(iterations):
    correct_cnt = 0
    for i in range(int(len(images)/batch_size)):
        batch_start, batch_end = ((i*batch_size), ((i+1)*batch_size))
        layer_0 = images[batch_start:batch_end]
        layer_0 = layer_0.reshape(layer_0.shape[0], 28, 28)
        sects = list()
        for row_start in range(layer_0.shape[1]-kernel_rows):
            for col_start in range(layer_0.shape[2]-kernel_cols):
                sect = get_image_section(layer_0, row_start, row_start+kernel_rows, col_start, col_start+kernel_cols)
                sects.append(sect) 
        expanded_input = np.concatenate(sects, axis=1) 
        es = expanded_input.shape 
        flattened_input = expanded_input.reshape(es[0]*es[1],-1) 
        kernel_output = flattened_input.dot(kernels) 
        layer_1 = tanh(kernel_output.reshape(es[0], -1)) 
        dropout_mask = np.random.randint(2, size=layer_1.shape)
        layer_1 *= dropout_mask*2
        layer_2 = softmax(np.dot(layer_1, weights_1_2)) 
        for k in range(batch_size):
            labelset = labels[batch_start+k:batch_start+k+1]
            _inc = int(np.argmax(layer_2[k:k+1])==np.argmax(labelset))
            correct_cnt+=_inc
        layer_2_delta = (labels[batch_start:batch_end]-layer_2) / (batch_size*layer_2.shape[0])
        layer_1_delta = layer_2_delta.dot(weights_1_2.T)*tanh2deriv(layer_1)
        layer_1_delta*=dropout_mask
        weights_1_2 += alpha*layer_1.T.dot(layer_2_delta)
        l1d_reshape = layer_1_delta.reshape(kernel_output.shape)
        k_update = flattened_input.T.dot(l1d_reshape)
        kernels -= alpha*k_update 
    print("I:"+str(j), "Train-Acc:", correct_cnt/float(len(images)))

I know the code formatting is quite dense and a bit hard to read. But it is working properly, with a test accuracy of about 87.5% as anticipated. However, I want to know more about why the code is working, and have a few questions about some places of the code:

First, on the batching part.

layer_2_delta = (labels[batch_start:batch_end]-layer_2) / (batch_size*layer_2.shape[0])

I already know that because the batch size is 128, you process 128 images at a time, and the (labels[batch_start:batch_end]-layer_2) delta is actually the total delta of all 128 images. So normally the "delta" should be divided by 128 to get the average delta of that layer. However this time the code divides that delta again by layer_2.shape[0], which is also the same as the batch size 128. I don't understand why I should do this. If I remove the extra 128, the code will fail to run. The program will end up with numpy warning "overflow" on the exp function(softmax) and then the training accuracy just stays at 8.7%. Why is this extra "/128" essencial to the code?

Another problem with the gradient descent. I learned that earlier in the book, gradient descent is like this:

delta = target_output - real_output
weight += input*delta*alpha

But here I'm facing the code with

kernels -= alpha*k_update

I used to think there is a problem with the code. But after "correcting" the "-" to a "+", I got a similar result on the AI model, with a test accuracy of 86%. How is this possible? There must be something fundemental about gradient descent that I haven't fully understood. What's the difference between the minus and the plus sign in gradient descent, and how to use them?

Solution

(1) is just a constant that is removed. It effectively corresponds to a "learning rate". If you remove it, you get this phenomenon where your learning rate is too high:

(2) If you change all weight updates from:

for weight in weights:
    weight -= learning_rate * dloss_dweight

to:

    weight += learning_rate * dloss_dweight

...then it is no longer finding the minimum loss. It is now finding the maximum of the loss, i.e., the worst possible model.

However, in your case, you only changed it for one layer's weights, so what probably happens is that your other parameters compensate for it.