I have the following dataset: https://raw.githubusercontent.com/Joffreybvn/real-estate-data-analysis/master/data/clean/belgium_real_estate.csv
I want to predict the price column, based on the other features, basically I want to predict house price based on square meters, number of rooms, postal code, etc.
So I did the following:
Load data:
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='BelgiumRealEstate')
data =dataset.to_pandas_dataframe()
data.sample(5)
Column1 postal_code city_name type_of_property price number_of_rooms house_area fully_equipped_kitchen open_fire terrace garden surface_of_the_land number_of_facades swimming_pool state_of_the_building lattitude longitude province region
33580 33580 9850 Landegem 1 380000 3 127 1 0 1 0 0 0 0 as new 3.588809 51.054637 Flandre-Orientale Flandre
11576 11576 9000 Gent 1 319000 2 89 1 0 1 0 0 2 0 as new 3.714155 51.039713 Flandre-Orientale Flandre
12830 12830 3300 Bost 0 170000 3 140 1 0 1 1 160 2 0 to renovate 4.933924 50.784632 Brabant flamand Flandre
20736 20736 6880 Cugnon 0 270000 4 218 0 0 0 0 3000 4 0 unknown 5.203308 49.802043 Luxembourg Wallonie
11416 11416 9000 Gent 0 875000 6 232 1 0 0 1 0 2 0 good 3.714155 51.039713 Flandre-Orientale Flandre
I hot encoded the category features, city, province, region, state of the building:
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building)
one_hot_city = pd.get_dummies(data.city_name, prefix='city')
one_hot_province = pd.get_dummies(data.province, prefix='province')
one_hot_region=pd.get_dummies(data.region, prefix ="region")
Then I added those columns to the pandas dataframe
#removing categorical features
data.drop(['city_name','state_of_the_building','province','region'],axis=1,inplace=True)
#Merging one hot encoded features with our dataset 'data'
data=pd.concat([data,one_hot_city,one_hot_state_of_the_building,one_hot_province,one_hot_region],axis=1)
I remove the price
x=data.drop('price',axis=1)
y=data.price
then train test split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
then I train:
x_df = DataFrame(x, columns= data.columns)
x_train, x_test, y_train, y_test = train_test_split(x_df, y, test_size=0.15)
#Converting the data into proper LGB Dataset Format
d_train=lgb.Dataset(x_train, label=y_train)
#Declaring the parameters
params = {
'task': 'train',
'boosting': 'gbdt',
'objective': 'regression',
'num_leaves': 10,
'learnnig_rate': 0.05,
'metric': {'l2','l1'},
'verbose': -1
}
#model creation and training
clf=lgb.train(params,d_train,10000)
#model prediction on X_test
y_pred=clf.predict(x_test)
#using RMSE error metric
mean_squared_error(y_pred,y_test)
However the RMSE its: 6053845952.2186775
which seems a huge number.
I am not sure what I am doing wrong here
I assume you are using sklearn.metrics.mean_squared_error
, thus just the MSE, without taking the root. Then 6053845952 ** 0.5 = 77806, which seems to me to be a reasonable mean absolute error for the quoted prices (e.g. that would correspond to less than 10% off for a price of 875000).