tensorflow optimization deep-learning gradient-descent

Why doesn't tf.train.GradientOptimizer work on my digit recognition model, while ShampooOptimizer from tensorflow.contrib works just fine?

I developed a neural network model for digit recognition using tensorflow. I used tf.train.GradientDescent as my optimizer, and I got very low prediction accuracy (around 11%). But if I only change my optimizer to ShampooOptimizer from tensorflow.contrib, it had good accuracy on validation data (around 92%).

I literally just changed one line of my code: from opt = tf.train.GradientDescentOptimizer(0.001) to opt = ShampooOptimizer() and it worked

I tried to stop in the middle of training, and I found some difference. For GradientDescentOptimizer: after 60 iterations, the best W had same number across dimensions (I set 87 dimensions for first layer), the best b had same number across dimensions. For shampoo: after 60 iterations, the best W had different numbers across dimensions, so does best b. I noticed this difference, but I don't know why.

import tensorflow as tf
import numpy as np
from mnist import MNIST
from tensorflow.contrib.opt.python.training.shampoo import *

mndata = MNIST()
data, labels = mndata.load_training()
data = np.array(data)
nb_classes = 10
labels = np.eye(nb_classes)[labels]

test_data, test_labels = mndata.load_testing()
test_data = np.array(test_data)
nb_classes = 10
test_labels = np.eye(nb_classes)[test_labels]

X = tf.placeholder(dtype='float32',shape = (None,784))          
y = tf.placeholder(dtype='float32',shape = (None, 10))

W = tf.Variable(initial_value=np.ones((784,87)),dtype='float32',name='W',trainable=True) 
b = tf.Variable(initial_value=np.ones((1,87)),dtype='float32',name='b', trainable=True)
preds_t1= tf.matmul(X,W) + b
preds_a1 = tf.nn.relu(preds_t1)                          

W2 = tf.Variable(initial_value=np.ones((87,10)),dtype='float32',name = 'W2')    
b2 = tf.Variable(initial_value=np.ones((1,10)),dtype='float32', name = 'b2')
logits = tf.matmul(preds_a1,W2) + b2
preds = tf.nn.softmax(logits, axis=1)

loss = tf.losses.softmax_cross_entropy(onehot_labels=y, logits=logits)
opt = tf.train.GradientDescentOptimizer(0.001)
opt_op = opt.minimize(loss = loss, var_list = [W, b, W2, b2])

s_preds = tf.argmax(preds, axis = 1)
s_labels = tf.argmax(y, axis = 1)
sacc, sacc_op = tf.metrics.accuracy(s_labels, s_preds)

saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(tf.initializers.global_variables())
    sess.run(tf.local_variables_initializer())

    best_W, best_b, best_W2, best_b2 = sess.run((W, b, W2, b2))
    stop_count = 0
    patience = 40
    best_loss = np.inf
    train_data, train_labels, valid_data, valid_labels = 
train_valid_split(data, labels, split = 0.2)
    for i in range(300):
        batch_X, batch_y = random_sampling(train_data, train_labels, 12000)
        sess.run((opt_op), feed_dict={X: batch_X, y: batch_y})
        s_loss, s_accuracy = sess.run((loss, sacc_op), feed_dict={X: valid_data, y: valid_labels})       # validation
        print('epoch: ' + str(i) + '; loss is: ' + str(s_loss) + '; slack_accuracy is :' + str(s_accuracy))
        # early stopping thing
        if s_loss < best_loss:
            best_loss = s_loss
            best_W, best_b, best_W2, best_b2 = sess.run((W, b, W2, b2))
        else:
            stop_count += 1
            if (stop_count >=  patience):
                print('Stopped at iteration: ' + str(i))
                break

Can anyone explain to me the difference between these two optimizer that lead to this difference in accuracy?

Solution

You are initializing all weights to the same value (using np.ones). This breaks your model because all hidden units will compute the same thing (and receive the same errors) so they will also learn the same thing, meaning you effectively have one hidden unit only. I don't know what the Shampoo optimizer does, but I suppose it has some kind of symmetry breaking.
Your model should work with default gradient descent if you replace the weight initial values by random numbers (e.g. tf.random_uniform([784,87], minval=-0.1, maxval=0.1) for the hidden layer). This prevents all units from being identical.