Search code examples
pythonregressionpolynomialsmultivariate-testingstandardized

Different RMSE when training/testing my polynomial regression before/after standardizing


I am in the process of building a regression model that will eventually be used by other users. This model serves to predict flower temperature by using multiple atmospheric variables such as air temperature, humidity, solar radiation, wind, etc.

After much doodling around, I've come to notice that a 2nd degree polynomial regression through SKlearn gives a good RMSE for both my training and testing data. However, since there are over 36 coefficients collinearity occurs and according to a comment on this post : https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia, collinearity would disturbe the beta and so the RMSE I am getting would be improper.

I've heard that perhaps I should standardize in order to remove collinearity or use an orthogonal decomposition but I don't know which would be better. In any case, i've tried standardizing my x variables and when I compute the RMSE for my training and testing data, I get the same RMSE for the training data but a different RMSE for the testing data.

Here is the code:

import pandas as pd
import numpy as np 
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import metrics

def OpenFile(ThePath):
    path = Location + ThePath
    Prepared_df = pd.read_csv(path, sep=',', encoding='utf-8')
    Prepared_df = Prepared_df.loc[:, ~Prepared_df.columns.str.contains('^Unnamed')]
    return(Prepared_df)

def EvaluateRegression(Test_data,Predict_data):
    MAE = np.round(metrics.mean_absolute_error(Test_data, Predict_data),3)
    MSE = np.round(metrics.mean_squared_error(Test_data, Predict_data),3)
    RMSE = np.round(np.sqrt(metrics.mean_squared_error(Test_data, Predict_data)),3)
    print('Mean absolute error :',MAE)
    print('Mean square error :',MSE)
    print('RMSE :',RMSE)
    return MAE,MSE,RMSE

#Read files ------------------------------------------------------------------------------------------------------------
Location = 'C:\\Users\...'

#Training data
File_Station_day = 'Flower_Station_data_day.csv' #X training data
File_TD = 'Flower_Y_data_day.csv' #Y training data
Chosen_Air = OpenFile(File_Station_day)
Day_TC = OpenFile(File_TD)

#Testing data 
File_Fluke_Station= 'Fluke_Station_data.csv' #X testing data
File_Fluke = 'Flower_Fluke_data.csv' #Y testing data
Chosen_Air_Fluke = OpenFile(File_Fluke)
Fluke_Station = OpenFile(File_Fluke_Station)     

#Prepare data --------------------------------------------------------------------------------------------------------
y_train = Day_TC
y_test = Fluke_data
#Get the desired atmospheric variables
Air_cols = ['MAXTemp_data', 'MINTemp_data', 'Humidity', 'Precipitation', 'Pression', 'Arti_InSW', 'sin_time'] #Specify the desired atmospheriv variables
X_train = Chosen_Air[Air_cols]  
X_test = Chosen_Air_Fluke[Air_cols]

#If not standardizing
poly = PolynomialFeatures(degree=2)
linear_poly = LinearRegression()
X_train_rdy = poly.fit_transform(X_train)
linear_poly.fit(X_train_rdy,y_train)
X_test_rdy = poly.fit_transform(X_test)

Input_model= linear_poly
print('Regression: For train')
MAE, MSE, RMSE = EvaluateRegression(y_train, Input_model.predict(X_train_rdy))
#For testing data
print('Regression: For test')
MAE, MSE, RMSE = EvaluateRegression(y_test,  Input_model.predict(X_test_rdy))

#Output:
Regression: For train
Mean absolute error : 0.391
Mean square error : 0.256
RMSE : 0.506
Regression: For test
Mean absolute error : 0.652
Mean square error : 0.569
RMSE : 0.754

#If standardizing
std = StandardScaler()
X_train_std = pd.DataFrame(std.fit_transform(X_train),columns = Air_cols)
X_test_std = pd.DataFrame(std.fit_transform(X_test),columns = Air_cols)
poly = PolynomialFeatures(degree=2)
linear_poly_std = LinearRegression()
X_train_std_rdy = poly.fit_transform(X_train_std)
linear_poly_std.fit(X_train_std_rdy,y_train)
X_test_std_rdy = poly.fit_transform(X_test_std)

Input_model= linear_poly_std
print('Regression: For train')
MAE, MSE, RMSE = EvaluateRegression(y_train, Input_model.predict(X_train_std_rdy))
#For testing data
print('Regression: For test')
MAE, MSE, RMSE = EvaluateRegression(y_test,  Input_model.predict(X_test_std_rdy))

#Output:
Regression: For train
Mean absolute error : 0.391
Mean square error : 0.256
RMSE : 0.506
Regression: For test
Mean absolute error : 10.901
Mean square error : 304.53
RMSE : 17.451

Why is the RMSE i am getting for the standardize testing data be so different than the non-standardize one? Perhaps the way i'm doing this is no good at all? Please let me know if I should attach the files to the post.

Thank you for your time!


Solution

  • IIRC, at least you should not call poly.fit_transform twice – you do it same way as with regression model – fit once with train data, transform later with test. Now you're re-training scaler (which probably gives you different mean/std), but apply same regression model.

    Side note: your code is rather hard to read/debug, and it easily lead to simple typos/mistakes. I suggest you wrapping training logic inside single function, and optionally using sklearn pipelines. This will make testing scaler [un]commenting single line, literally.