Search code examples
pythonpandasmachine-learningregressionsklearn-pandas

multiple linear regression house price r2 score problem


I Have Sample House Price Data And Simple Code :

import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

data = pd.read_csv('house_price_4.csv')
df = pd.DataFrame(data)
df['Area'] = df['Area'].str.replace(',', '')
df = df.dropna()

# Encoding the categorical feature 'Address'
df['Address'] = df['Address'].astype('category').cat.codes
df['Parking'] = df['Parking'].replace({True: 1, False: 0})
df['Warehouse'] = df['Warehouse'].replace({True: 1, False: 0})
df['Elevator'] = df['Elevator'].replace({True: 1, False: 0})

X = df.drop(columns=['Price(USD)','Price'])
y = df['Price']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

r_squared = r2_score(y_test, y_pred)
print(f'R^2 Score: {r_squared:.4f}')   

                                                                  

My R2 Score is Very Low : 0.34

How Can I Get Higher R2 Score?

This is My Sample Data : https://drive.google.com/file/d/14Se90XbGJivftq3_VrtgRSAlkCplduVX/view?usp=sharing


Solution

  • Instead of linear regression, you can use other models to test if you can do modelling on your data. And btw, R² is not the biggest problem in using linear regression. Use my answer to study the residual plot in both cases, as the residuals of assuming the linear regression clearly hint to a heteroscedasticity. Check the comparison here:

    fig, axs = plt.subplots(nrows = 1, ncols = 2) # define subplots
    ###################################################################################
    lrModel = LinearRegression() # random forest
    lrModel.fit(XTrain, yTrain) # fit
    lryPred = lrModel.predict(XTest) # test
    lrRMSE = mean_squared_error(yTest, lryPred, squared=False) # RMSE
    lrR2 = r2_score(yTest, lryPred) # R2
    axs[0].scatter(lryPred, yTest) # scatter
    axs[0].set_title("Linear Regression\nR² = "+str(round(lrR2,2))+"; RMSE = "+str(round(lrRMSE)))
    ###################################################################################
    dtModel = DecisionTreeRegressor(random_state=42) # decision tree
    dtModel.fit(XTrain, yTrain) # fit
    dtyPred = dtModel.predict(XTest) # test
    dtRMSE = mean_squared_error(yTest, dtyPred, squared=False) # RMSE
    dtR2 = r2_score(yTest, dtyPred) # R2
    axs[1].scatter(dtyPred, yTest) # scatter
    axs[1].set_title("Decision Tree Regressor\nR² = "+str(round(dtR2,2))+"; RMSE = "+str(round(dtRMSE)))
    

    The results are like this:

    Actual vs predicted

    The choice of linear regression was false from the beginning. Predictions are also going in the minus. Use decision tree or random forest, they should give similar fits.