python pandas machine-learning regression sklearn-pandas

multiple linear regression house price r2 score problem

I Have Sample House Price Data And Simple Code :

import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

data = pd.read_csv('house_price_4.csv')
df = pd.DataFrame(data)
df['Area'] = df['Area'].str.replace(',', '')
df = df.dropna()

# Encoding the categorical feature 'Address'
df['Address'] = df['Address'].astype('category').cat.codes
df['Parking'] = df['Parking'].replace({True: 1, False: 0})
df['Warehouse'] = df['Warehouse'].replace({True: 1, False: 0})
df['Elevator'] = df['Elevator'].replace({True: 1, False: 0})

X = df.drop(columns=['Price(USD)','Price'])
y = df['Price']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

r_squared = r2_score(y_test, y_pred)
print(f'R^2 Score: {r_squared:.4f}')

My R2 Score is Very Low : 0.34

How Can I Get Higher R2 Score?

This is My Sample Data : https://drive.google.com/file/d/14Se90XbGJivftq3_VrtgRSAlkCplduVX/view?usp=sharing

Solution

Instead of linear regression, you can use other models to test if you can do modelling on your data. And btw, R² is not the biggest problem in using linear regression. Use my answer to study the residual plot in both cases, as the residuals of assuming the linear regression clearly hint to a heteroscedasticity. Check the comparison here:

fig, axs = plt.subplots(nrows = 1, ncols = 2) # define subplots
###################################################################################
lrModel = LinearRegression() # random forest
lrModel.fit(XTrain, yTrain) # fit
lryPred = lrModel.predict(XTest) # test
lrRMSE = mean_squared_error(yTest, lryPred, squared=False) # RMSE
lrR2 = r2_score(yTest, lryPred) # R2
axs[0].scatter(lryPred, yTest) # scatter
axs[0].set_title("Linear Regression\nR² = "+str(round(lrR2,2))+"; RMSE = "+str(round(lrRMSE)))
###################################################################################
dtModel = DecisionTreeRegressor(random_state=42) # decision tree
dtModel.fit(XTrain, yTrain) # fit
dtyPred = dtModel.predict(XTest) # test
dtRMSE = mean_squared_error(yTest, dtyPred, squared=False) # RMSE
dtR2 = r2_score(yTest, dtyPred) # R2
axs[1].scatter(dtyPred, yTest) # scatter
axs[1].set_title("Decision Tree Regressor\nR² = "+str(round(dtR2,2))+"; RMSE = "+str(round(dtRMSE)))

The results are like this:

The choice of linear regression was false from the beginning. Predictions are also going in the minus. Use decision tree or random forest, they should give similar fits.