My datasets has gotten extremely large so I am unable to use typical OLS methods to calculate my linear regression estimators, so I wanted to use a typical optimizer (Adam seems to be a good fit)
I understand that I can do this fairly simply with Keras, see below example
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# Define the model
def build_model(input_dim):
model = Sequential()
# Using a smaller standard deviation for the normal initializer
model.add(Dense(1, input_dim=input_dim, kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05), activation='linear'))
# Increased learning rate
optimizer = Adam(learning_rate=0.1)
model.compile(loss='mse', optimizer=optimizer, metrics=['mse'])
return model
# Example usage:
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]], dtype=float)
y = np.array([3, 5, 7, 9, 11], dtype=float)
# Build and train the model
model = build_model(input_dim=2)
model.fit(X, y, epochs=1000, verbose=0, batch_size=5) # Reduced number of epochs and batch size
# Make predictions
predictions = model.predict(X)
print("Predictions:", predictions.flatten())
# Output the model summary to check the structure
model.summary()
model.get_weights()
However, my problem is that even after 1000 epochs it still doesnt converge towards the obvious 1,1 weight, its around 1.15 / 0.85
Is Adam not a good optmizer for this example, or am I doing something wrong - I remember playing around with SGD some time ago which I recall as convering extremely quickly on linreg problems. Its a bit concerning for me as I need to run this on a matrix that will be more than 1,000,000 x 100, and running 1000 epochs there will take forever.
The issue is your choice of training data. Your data is of the form (x1, x2)
, but in all training examples x2 == x1 + 1
. So you really only have a single input, but two weights plus a bias, leading to infinitely many solutions. Your function to be learned is basically 2 * x1 + 1
. But since you have two weights, there are different ways to split this up, for example
w=(0.9, 0.1), b=0.1
w=(0.7, 0.3), b=0.3
Since one solution is not "better" than another, there is no reason why it should converge to the "obvious" solution. Possible fixes:
use_bias=False
in your Dense layer, forcing a bias of 0, in which case there is only one solution for w
.If you want to read more -- your data exhibits Multicolinearity.