python tensorflow machine-learning convergence learning-rate

Why do I need a very high learning rate for this model to converge?

I have a simple model in tensorflow which is being trained on the first 1000 images in the MNIST datset. From my previous experience the learning rates which I used were of the order of around 0.001, however for my model to converge the learning rate needs to be far heigher, at least larger than 1. The model is shown below.

def gen_model():
    return tf.keras.models.Sequential([
        tf.keras.Input(shape=(28,28,)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='sigmoid'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    
model = gen_model()
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=5), loss='mean_squared_error')
model.summary()
model.fit(x_train, y_train, batch_size=1000, epochs=10000)

Is it expected for models of this form to require an extremely high learning rate, or is there something I have missed? When I use a learning rate of around 0.001 the loss changes incredibly slowly.

The dataset was created with the following code:

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype("float32") / 255.0

x_train = x_train.reshape(60000,28,28)[:1000];
y_train = y_train[:1000];

y_train = tf.one_hot(y_train, 10)

Solution

Generally speaking, models that require learning rates larger than 1 raise a red flag for me. It seems like your model is a vanilla multilayer perceptron, so there's nothing overly complicated about that, but there are a couple things about your setup that stand out:

The output from your model uses a softmax, which is normally used to represent values from a categorical distribution (i.e., 1-of-k) -- this is typical for a classification model. But the loss you're using is typically used for optimizing Gaussian or regression outputs. You might want to try using a cross-entropy loss to see if that helps.
The output from your model is in probability space, so the values you get out from your model are in [0, 1]. The loss you're using is averaging the squared differences between the model output and the target 1-hot vector (whose values are in {0, 1}). The value you'll get for this loss is always smaller than 1, so with a learning rate less than 1, and multiplying by the existing model weights, the delta that you'll apply to your model weights is always going to be small. Sometimes that's a good thing, but my guess is that in this case -- and particularly at the start of training when the model weights aren't near their optimal values -- this is going to be quite slow.
Related to the above point, you might try initializing your model weights with a larger range of values than the default. This would help make the gradient values larger, but could also make the model more likely to diverge.
You could also try to replace your softmax output activation with a plain linear activation, in effect converting your model's output to (unnormalized) log-probability space. Then you'd need to change your dataset labels to also represent target log-probability values, which isn't possible exactly, but could get close with something like 1e8 * (1 - one_hot). But if you wanted to go this route, you'd effectively be implementing a cross-entropy loss yourself; see the first point.