Keras fails to find solution to linear convex problem

I wrote this reproducible code to demonstrate the problem:

import numpy as np
import keras
import tensorflow as tf

n, d = 2, 3
A = np.random.random((n, d))
b = np.random.random((n, 1))
x = np.linalg.lstsq(A, b, rcond=None)[0]
print("Numpy MSE is {}".format((np.linalg.norm(A @ x - b) ** 2) / n))

model = keras.models.Sequential()
model.add(keras.layers.Dense(1, use_bias=False, activation='linear'))
opt = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0, nesterov=False)
model.compile(loss="mse", optimizer=opt)
model.fit(A, b, batch_size=A.shape[0], epochs=10000, verbose=0)
x = model.layers[0].get_weights()[0]
print("Keras MSE is {}".format((np.linalg.norm(A @ x - b) ** 2) / n))

Basically I am solving under-determined system of linear equations Ax=b in two ways, once with numpy, and once with keras standard gradient descent.

When I run it, I get this output:

Numpy MSE is 6.162975822039155e-33
Keras MSE is 1.3108133821545518e-10

numpy yields a much better result, but I'm still willing to accept keras as a solution, 10^(-10) is fairly small.

Now increase n to 200 and d to 300. The output is now:

Numpy MSE is 1.4348640308871558e-30
Keras MSE is 0.0001953624326696054

Now not only that numpy is much better, but as far as I am concerned, keras did not find a solution. The result we got is not close enough to zero, and I am stuck. Changing the learning rate or adding iterations does not change the result significantly. Why does this happen?

I know there's a solution. I want the error to be at most 10^(-10), using keras, for large dimensional data such as the n = 200 d = 300 case

TLDR: I'm desperately trying to overfit. I know there's a solution that gives me 0 loss. My problem is linear and convex, classic under-determined system, keras won't find that solution and give me 0 training loss.

Solution

You are missing the input_shape argument in the layer definition. Not quite sure why it does not work without a defined input_shape (the shape of the weights seems OK); but, according to the documentation:

In general, it's a recommended best practice to always specify the input shape of a Sequential model in advance if you know what it is.

The other thing is, by setting batch_size=A.shape[0], you are actually using batch gradient descent, and not stochastic; in order to use SGD, you need to set a batch_size smaller than the size of your data sample.

So, with the following changes in your code in the high-dimensional case (plus replacing all keras uses with tf.keras, since mixing the two is not good practice):

# n, d = 200, 300

model.add(tf.keras.layers.Dense(1, input_shape=(A.shape[1],), use_bias=False, activation='linear'))

model.fit(A, b, batch_size=32, epochs=10000, verbose=0)

after 10,000 epochs, the result is:

Keras MSE is 1.9258555439788135e-10

while iterating for 10,000 more epochs (i.e. total 20,000), we get:

Keras MSE is 1.2521153241468356e-13

Repeating the runs, we get qualitatively similar (but of course not identical) results.