python scikit-learn cross-validation decision-tree

cross_val_score default scoring not consistent?

According to the docs,

for the cross_val_score's scoring parameter: If None, the estimator’s default scorer (if available) is used.

For a DecisionTreeRegressor, the default criterion is mse. So why am I getting different results here?

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)

- cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error')

>>> array([ 46.94808341,  18.78121305,  18.19914701,  18.06935431,
        17.19546733,  28.91247609,  39.41410887,  21.30453162,
        31.96443414,  23.74191199])


cross_val_score(dt, X_train, y_train, cv=10)

>>> array([ 0.35723619,  0.75254466,  0.7181376 ,  0.65718608,  0.72531937,
        0.4752839 ,  0.43169728,  0.63916363,  0.41406146,  0.68977882])

If I had to guess, it seems the default scoring is R2 instead of mse. Is my understanding of default scorer correct or is this a bug?

Solution

The default scorer of a DecisionTreeRegression is the r2-score, you can find it in the docs of the DecisionTreeRegression.

 score(self, X, y, sample_weight=None)[source]

    Return the coefficient of determination R^2 of the prediction.

    The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.