Search code examples
regressionrandom-forestdata-analysisnon-linear-regressionregression-testing

why random forest regression return a very bad result?


I'm trying to use randomforestregressor() in scikit_learn to model some data.After processing my raw data, the data I applied to randomforestregressor() is as follows.

enter image description here

The following is only a little part of my data. In fact, there are around 6000 pieces of data.

Note, the first column is the datetimeindex of my created DataFrame 'final_data' that contains all the data. In addition, the data in column4 were strings. I just converted them to numbers by a map function.

import pandas as pd
from datetime import datetime     
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

S_dataset1= final_data[(final_data.index >=pd.to_datetime('20160403')) & 
(final_data.index <= pd.to_datetime('20161002'))]

S_dataset2= final_data[(final_data.index >=pd.to_datetime('20170403')) & 
(final_data.index <= pd.to_datetime('20170901'))]

W_dataset = final_data[(final_data.index >=pd.to_datetime('20161002')) & 
(final_data.index <= pd.to_datetime('20170403'))]

S_dataset = pd.concat([S_dataset1,S_dataset2])
A = W_dataset.iloc[:, :8]
B = W_dataset.loc[:,'col20'] 
W_data = pd.concat([A,B],axis = 1)
X = W_data.iloc[:,:].values
y = W_dataset['col9'].values
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.3,
                                            random_state=1)
forest = RandomForestRegressor(n_estimators = 1000,criterion='mse',
                                          random_state=1,n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred),
                                       r2_score(y_test, y_test_pred)))

Here is my code for predicting col9. I separated the final_data into two seasons which may make the prediction more accurate. However, the result is very bad. R2 score of train is around 0.9, but for test, it is only around 0.25. I really don't know why I get a so bad result. Could some tell me where I was wrong and how can improve my model? Many thanks!!!


Solution

  • I think the problem is because I didn't consider the effect of datetime to the prediction. After converting these datetimeindexs to their numerical values and input to my model, I got a quite good result. The R2 score is around 0.95-0.98.