Search code examples
pythonmachine-learninglinear-regressionpolynomial-math

How should I define the y_true in Polynomial regression evaluation using rsquare and mse in python


from sklearn.preprocessing import PolynomialFeatures
train_x_p = np.asanyarray(train[['FUELCONSUMPTION_COMB_MPG']])
poly = PolynomialFeatures(degree = 3)
train_x_poly = poly.fit_transform(train_x_p)
regr.fit(train_x_poly, train_y)
print('Coefficients: ', regr.coef_)
print('Intercept', regr.intercept_)

test_x_poly = poly.fit_transform(test_x)
test_y_poly1 = np.asanyarray(test[['CO2EMISSIONS']]) #im not sure especially about this line 
test_y_hat_poly1 = regr.predict(test_x_poly)

mse = metrics.mean_squared_error(test_y_poly1, test_y_hat_poly1)
r2 = (r2_score(test_y_poly1,test_y_hat_poly1))
print('MSE&R2SQUARE polynomial linear regression (FUELCONSUMPTION_COMB_MPG): ')
print('MSE: ',mse)
print('r2-sq: ',r2)

and also what made me feel it's incorrect the results of mse should I transform the test y to poly and if I should how can I do it ?


Solution

  • No, you should not transform your y_true values. What polynomial features does is that it takes x_1, x_2, ..., x_p predictors and applies polynomial transformation of a chosen degree to each one of them.

    If you have 2 predictors x_1 and x_2 and apply polynomial transformation of 3rd degree, you end up with problem of the form:

    y = b_0 + b_1 * x_1 + b_2 * x_1^2 + b_3 * x_1^3 + b_4 * x_2 + b_5 * x_2^2 + b_6 * x_2^3

    You want to do this when there is a non-linear relationship between predictors and the response and you want to use linear model to fit the data. y_true stays the same whether you are using polynomial features or not (or most of the other regression models).

    Your code is almost fine, except for one issue - you are calling fit_transform on the test data which is something that you would never want to do. You have already fitted the polynomial features object on the training data, all you need to do is to call transform method to transform your test data.

    test_x_poly = poly.transform(test_x)
    

    Here is an example of how it looks like when you use polynomial features when there is polynomial relationship between predictor and response.

    1. get the data (I will just generate some)
    X = np.random.randint(-100, 100, (100, 1))
    y = X ** 2 + np.random.normal(size=(100, 1))
    
    1. train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    1. fit the polynomial features on the train data
    poly_features = PolynomialFeatures(degree=2)
    X_train_poly = poly_features.fit_transform(X_train) # transform the data as well
    
    1. fit the linear regression model on the train data
    reg = LinearRegression()
    reg.fit(X_train_poly, y_train)
    
    1. (only for illustration purposes - visualize the regression line - only applicable when there is only one predictor)
    reg_line_x = poly_features.transform(np.linspace(-100, 100, 1000).reshape((-1, 1)))
    reg_line_y = reg.predict(reg_line_x)
    plt.scatter(X_train_poly[:, 1].ravel(), y_train)
    plt.plot(reg_line_x[:, 1].ravel(), reg_line_y, c="red", label="regression line")
    plt.legend()
    plt.show()
    

    enter image description here

    1. transform the X_test data and make the prediction
    # do NOT call fit_transform here
    X_test_poly = poly_features.transform(X_test)
    y_pred = reg.predict(X_test_poly)
    

    There is also a more convenient way of doing this by building a pipeline that handles everything (that is polynomial transformation and regression in your case) so that you don't have to manually perform each individual step.

    from sklearn.pipeline import Pipeline
    
    pipe = Pipeline([
            ("poly_features", poly_features),
            ("regression", reg)
    ])
    
    y_pred = pipe.predict(X_test)
    
    print(f"r2 : {r2_score(y_test, y_pred)}")
    print(f"mse: {mean_squared_error(y_test, y_pred)}")
    

    r2 : 0.9999997923643911

    mse: 1.4848830127345198


    Note that the fact that r squared or MSE is showing poor values in your case doesn't mean that your code is wrong. It might be the case that your data is not suited for the task, or that you need to use different degree of polynomial transformation - you might be either underfitting or overfitting the training data etc.