Search code examples
pythonscikit-learn

What is the default accuracy scoring in cross_val_score() in sklearn?


I have a regression model made using random-Forest. I made pipelines using scikit to process data and now have used RandomForestRegressor to predict. I want to get the accuracy of model. because of the problem of over-fitting I decided to use the cross_val_score function to get rid of that.

from sklearn.ensemble import RandomForestRegressor
forest_reg = make_pipeline(preprocessing,
RandomForestRegressor(random_state=1))
acc = cross_val_score(forest_reg, data, labels,cv=10)

then, I used this to get the accuracy:

print(acc.mean(),acc.std())

It gives me around 0.84 and 0.06.

I understand the standard deviation part but how is the first one calculated? Is 0.84 good? Is there a better scoring way to get accuracy?


Solution

  • Firt, keep in mind that accuracy is typically used for classification tasks. Not for regression.

    The documentation says:

    scores: ndarray of float of shape=(len(list(cv)),)

    Array of scores of the estimator for each run of the cross validation.

    You have set the cv parameter to 10. It means that acc is an array of 10 scores.

    But here you don't have the accuracy of each run. Instead you have the coefficient of determination of the random forest prediction:

    Again the cross_val_score documentation says:

    scoring: str or callable, default=None

    A str (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y) which should return only a single value.

    Similar to cross_validate but only a single metric is permitted.

    If None, the estimator’s default scorer (if available) is used.

    And the default scorer of RandomForestRegressor is R²:

    score(X, y, sample_weight=None)

    Return the coefficient of determination of the prediction.

    The coefficient of determination is defined as , where is the residual sum of squares ((y_true - y_pred)** 2).sum() and is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a score of 0.0.

    Based on your question, you have an R² value with a mean of 0.86 and a standard deviation of 0.06 over 10 runs. I cannot determine if this is good or not; only you can decide if it is acceptable.