I'm trying to use randomforestregressor()
in scikit_learn to model some data.After processing my raw data, the data I applied to randomforestregressor()
is as follows.
The following is only a little part of my data. In fact, there are around 6000 pieces of data.
Note, the first column is the datetimeindex
of my created DataFrame 'final_data
' that contains all the data. In addition, the data in column4 were strings. I just converted them to numbers by a map
function.
import pandas as pd
from datetime import datetime
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
S_dataset1= final_data[(final_data.index >=pd.to_datetime('20160403')) &
(final_data.index <= pd.to_datetime('20161002'))]
S_dataset2= final_data[(final_data.index >=pd.to_datetime('20170403')) &
(final_data.index <= pd.to_datetime('20170901'))]
W_dataset = final_data[(final_data.index >=pd.to_datetime('20161002')) &
(final_data.index <= pd.to_datetime('20170403'))]
S_dataset = pd.concat([S_dataset1,S_dataset2])
A = W_dataset.iloc[:, :8]
B = W_dataset.loc[:,'col20']
W_data = pd.concat([A,B],axis = 1)
X = W_data.iloc[:,:].values
y = W_dataset['col9'].values
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.3,
random_state=1)
forest = RandomForestRegressor(n_estimators = 1000,criterion='mse',
random_state=1,n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Here is my code for predicting col9. I separated the final_data
into two seasons which may make the prediction more accurate. However, the result is very bad. R2 score of train is around 0.9, but for test, it is only around 0.25. I really don't know why I get a so bad result. Could some tell me where I was wrong and how can improve my model? Many thanks!!!
I think the problem is because I didn't consider the effect of datetime to the prediction. After converting these datetimeindexs to their numerical values and input to my model, I got a quite good result. The R2 score is around 0.95-0.98.