I am trying to generate R square value from cross_validation.cross_val_score which is about 0.35 and then I applied the model into the same train dataset and used "r2_score" function to generate R square, which is about 0.87. I wonder I was given two results with so much difference. Any help will be appreciated. The codes are attached below.
num_folds = 2
num_instances = len(X_train)
scoring ='r2'
models = []
models.append(('RF', RandomForestRegressor()))
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold,
scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
model.fit(X_train, Y_train)
train_pred=model.predict(X_train)
r2_score(Y_train, train_pred)
Actually they are the same. In your case, you have used r2
for cross validation score. I mean, you divided the train set into 2 part (num_folds = 2
) and r2
were calculated for these two set and then averaged cv_results.mean()
. To sum up, you have used r2
for validation score, whereas r2_score
to evaluate performance of model on whole train set.