python machine-learning neural-network mnist

neural network not optimizing weights of first layer, returning all 1's for z1

So im building a neural network in python right now losely following Andrew Ng's machine learning course. It has 3 layers (all sigmoid) and works on predicting the MNIST dataset. But it fails to actually predict the dataset, and while the cost is decreasing with every iteration accuracy remains at around 0.1, indicating something isnt working properly. After playing around and letting the program display the different steps i noticed that z1 (calculated from the train_X@theta1 in the activation layer) is pretty much uniformly 1's, which seems to be hindering the network to function. This is due to the train_X@theta1 being high values, making e^(-z) basically 0 and the sigmoid function returning basically 1. But that also leads to the derivative for theta1 (due to being multiplied by (1-z1) to just be 0's, hindering the network from doing anything. Things i have tried are:

Regularizing the dataset

Playing around with alpha values (which doesnt do anything, as expected)

Making theta1 values start out really small (all calculated through a uniform distribution, which i then multiplied by like 0.000001): this, while fixing the problem for the first iteration just makes the network optimize theta1 to where it leads to uniformly 1's at z1 again, so i suspect backwards/forward propagation to not be working properly? Even though I basically copied that from one of my working solutions to an exercise in the course in octave, which produced resonable results and ended up with a high accuracy.

Here is my algorithm:

def nn_forwardPropagation(train_X, train_Y, theta1, theta2, theta3):
    accuracy = 0
    J = 0
    #forward propagation, always adding 1's for the bias unit
    train_X = np.c_[ np.ones((np.shape(train_X)[0],1)), train_X]
    z1 = sigmoid([email protected](theta1))
    a1 = np.c_[ np.ones((np.shape(z1)[0],1)), z1]
    z2 = sigmoid([email protected](theta2))
    a2 = np.c_[ np.ones((np.shape(z2)[0],1)), z2]
    #last step
    predValues = np.zeros((10,len(train_Y)))
    y1 = predValues.copy()
    predValues2 = y1.copy()
    for j in range(len(train_Y)):
        for i in range(np.shape(theta3)[0]):
            predValues[i,j] = sigmoid(a2[j,:]@np.transpose(theta3[i,:]))
        #making y1 our "target matrix" where we would like to see a 1 at the place of the right number and 0's everywhere else 
        y1[train_Y[j],j] = 1
        J += np.sum((np.transpose(-y1[:,j])@np.log(predValues[:,j]))  -  (1-np.transpose(y1[:,j]))@np.log(1 - predValues[:,j]))/len(train_Y)
        #in order to calculate accuracy, we just assume that the highest value in our predicted values dictates which value is "right" and looks if that's right
        predValues2[np.where(predValues[:,j] == max(predValues[:,j])),j] = 1 
        if np.array_equiv(predValues2[:,j], y1[:,j]):
            accuracy += 1/len(train_Y)
    #calculating the derivatives
    delta3 = predValues - y1  
    delta2 = np.transpose(delta3) @ theta3[:,1:] * z2 * (1-z2)
    delta1 = delta2 @ theta2[:,1:] * z1 * (1-z1)
    D3 = delta3 @ a2 
    D2 = np.transpose(delta2) @ a1
    D1 = np.transpose(delta1) @ train_X
    theta3_grad = D3 / len(train_Y)
    theta2_grad = D2 / len(train_Y)
    theta1_grad = D1 / len(train_Y)
    return J, theta3_grad, theta2_grad, theta1_grad, accuracy
    
def nn_runner(train_X, train_Y, iterations, alpha1, alpha2, alpha3):
    theta1 = np.random.rand(4,785)
    theta2 = np.random.rand(7,5)
    theta3 = np.random.rand(10,8)
    for i in range(1,iterations+1):
        J, theta3_grad, theta2_grad, theta1_grad, accuracy = nn_forwardPropagation(train_X, train_Y, theta1, theta2, theta3)
        theta3 -= alpha3*theta3_grad
        theta2 -= alpha2*theta2_grad
        theta1 -= alpha1*theta1_grad
        if i % 1 == 0:
            print("At iteration",i,":", J)
    print("This amounts to a total accuracy on training data of", accuracy)
    return theta3, theta2, theta1

Does anybody here know what i did wrong/could look into to make this working? Thanks

Solution

Ok so i did some more testing and figured out that it really was both wrongly initialized theta values as well as a learning rate that was to high and not having the MNIST dataset normalized (which kept theta1 from learning as there are a lot of 0's in the MNIST dataset due to a lot of pixels being black which made the gradient 0 as well). After doing that (initializing theta values by a random distribution and then modifying those by multiplying by 2 * sqrt(2) / sqrt(# of neurons in previous layer) and subtracting by that again) the model was able to (unfortunately only after a couple 100 epochs) predict the dataset with an accuracy of 95.6% on training and 91.1% on testing data, but I'll look into ways to make this converge faster in the future (like stochastic gradient descent or a different optimization algorithm or playing around with the learning rate a bit). Hope this might help someone in the future.