Search code examples
pythontensorflowkeraslinear-regression

Multiple Linear Regression with TensorFlow


I'm trying to perform a Multiple Linear Regression with TensorFlow and confront the results with statsmodels library.

I generated two random variables X1 and X2 (so that anyone can reproduce it) that will explain the Y variable. The X2 variable is completely useless for this regression, it's just noise with a big scale so that the coefficient will result not significant (p-val close to 1). At the end I should obtain a model that is basically. y_data = alpha + (0.25)x1 + (0.00)x2 + error.

I tried to adapt this code to my randomly generated data, but unfortunately, this is not working at all. This below is my try:

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow import keras
import datetime


#generating variables:
np.random.seed(1)
lin_x = np.arange(0,200,2)
y_data = np.true_divide(lin_x,4)
n = np.shape(lin_x)
##adding noise:
rand1 = norm.rvs(loc=0,scale=5,size=n)
np.random.seed(2)
rand2 = norm.rvs(loc=0,scale=1000,size=n)
x1 = np.add(lin_x,rand1)
x2 = rand2

#creating the X matrix: beta = (X'X)^-1(X'y):
x_data = np.column_stack((x1,x2))
#adding ones vector for the intercept:
x_data = sm.add_constant(x_data)

#MLR with statsmodels:
mod = sm.OLS(y_data,x_data)
LinReg = mod.fit()
print(LinReg.summary())



#MLR with tensorflow:
normalizer = preprocessing.Normalization()
normalizer.adapt(x_data)

normalized_data = normalizer(x_data)

print(normalized_data)


model = tf.keras.Sequential([
  normalizer,
  layers.Dense(units=1)
])

model.compile(loss = tf.losses.MeanSquaredError(),
              optimizer = tf.keras.optimizers.SGD(
                learning_rate=0.06, momentum=0.0, nesterov=True, name="SGD",
              ))

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

model.summary()
print('--------------')
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]
print('--------------')

x_data_tf = tf.convert_to_tensor(x_data)
y_data_tf = tf.convert_to_tensor(y_data)
model.fit(y_data_tf,x_data_tf, epochs=1000, callbacks=[tensorboard_callback])

weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]

print("TensorFlow results: ")
print("weigths: ", weights)
print("biases: ", biases)

print(LinReg.summary())

How can I get the same coefficients that I've obtained with the statsmodels library using TensorFlow?


Solution

  • The key issues with your code are the following:

    1. While it is necessary to add a column of ones to the features matrix x_data before running the regression with statsmodels, this is not necessary when running the regression with tensorflow. This means that you are passing 3 features to tensorflow instead of 2, where the additional feature (the first column of x_data) is constant.
    2. You are normalizing x_data after the first column of ones has already been added with x_data = sm.add_constant(x_data). As a column of ones has zero variance, after normalization you get a column of nan (as you are dividing by zero). This means that the first of the 3 features that you are passing to tensorflow is completely missing (i.e. it's always nan).
    3. While statsmodels takes as inputs at first y and then X, tensorflow takes as inputs at first X and then y. This means that you have switched the features and target when running the regression in tensorflow.

    I included a complete example below.

    import numpy as np
    import statsmodels.api as sm
    import tensorflow as tf
    
    np.random.seed(0)
    
    N = 500    # number of samples
    K = 2      # number of features
    
    # generate the features
    m = np.array([0, 0])     # means
    s = np.array([5, 1000])  # variances
    X = np.random.multivariate_normal(m * np.ones(K), s * np.eye(K), N)
    
    # generate the target
    b = 0.5                   # bias
    w = np.array([0.107, 0])  # weights
    y = b + np.dot(X, w) + np.random.normal(0, 1, N)
    
    # normalize the features
    X = (X - np.mean(X, axis=0)) / np.std(X, axis=0, ddof=1)
    
    # run a linear regression with statsmodels
    reg = sm.OLS(y, sm.add_constant(X)).fit()
    
    # run a linear regression with tensorflow
    model = tf.keras.Sequential([
      tf.keras.layers.Dense(units=1)
    ])
    
    model.compile(loss=tf.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.SGD(learning_rate=0.01))
    model.fit(X, y, epochs=1000, verbose=0)
    
    bias = model.layers[0].get_weights()[1]
    weights = model.layers[0].get_weights()[0].flatten()
    
    # compare the parameters
    print('Statsmodels parameters:')
    print(np.round(reg.params, 3))
    # [0.523 0.25 0.063]
    
    print('Tensorflow parameters:')
    print(np.round(np.append(bias, weights), 3))
    # [0.528 0.25 0.066]