Linear regression parameter estimate using Keras

My datasets has gotten extremely large so I am unable to use typical OLS methods to calculate my linear regression estimators, so I wanted to use a typical optimizer (Adam seems to be a good fit)

I understand that I can do this fairly simply with Keras, see below example

    import numpy as np
    import tensorflow as tf
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.optimizers import Adam

    # Define the model
    def build_model(input_dim):
        model = Sequential()
        # Using a smaller standard deviation for the normal initializer
        model.add(Dense(1, input_dim=input_dim, kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05), activation='linear'))
        # Increased learning rate
        optimizer = Adam(learning_rate=0.1)
        model.compile(loss='mse', optimizer=optimizer, metrics=['mse'])
        return model

    # Example usage:
    X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]], dtype=float)
    y = np.array([3, 5, 7, 9, 11], dtype=float)

    # Build and train the model
    model = build_model(input_dim=2), y, epochs=1000, verbose=0, batch_size=5)  # Reduced number of epochs and batch size

    # Make predictions
    predictions = model.predict(X)
    print("Predictions:", predictions.flatten())

    # Output the model summary to check the structure

However, my problem is that even after 1000 epochs it still doesnt converge towards the obvious 1,1 weight, its around 1.15 / 0.85

Is Adam not a good optmizer for this example, or am I doing something wrong - I remember playing around with SGD some time ago which I recall as convering extremely quickly on linreg problems. Its a bit concerning for me as I need to run this on a matrix that will be more than 1,000,000 x 100, and running 1000 epochs there will take forever.


  • The issue is your choice of training data. Your data is of the form (x1, x2), but in all training examples x2 == x1 + 1. So you really only have a single input, but two weights plus a bias, leading to infinitely many solutions. Your function to be learned is basically 2 * x1 + 1. But since you have two weights, there are different ways to split this up, for example

    • w=(0.9, 0.1), b=0.1
    • w=(0.7, 0.3), b=0.3
    • etc., maybe you can already see the pattern.

    Since one solution is not "better" than another, there is no reason why it should converge to the "obvious" solution. Possible fixes:

    • Use a better selection of training data that does not have this issue.
    • Set use_bias=False in your Dense layer, forcing a bias of 0, in which case there is only one solution for w.

    If you want to read more -- your data exhibits Multicolinearity.