I am fitting this model from sklearn
.
LogisticRegressionCV(
solver="sag", scoring="neg_log_loss", verbose=0, n_jobs=-1, cv=10
)
The fitting results in a model.score
(on training set) of 0.67 and change. Since there is no way (or I don't know how) to access the results of the cross validation performed as part of the model fitting, I run as separate cross validation on the same model with
cross_val_score(model, X, y, cv=10, scoring="neg_log_loss")
This returns an array of negative numbers
[-0.69517214 -0.69211235 -0.64173978 -0.66429986 -0.77126878 -0.65127196
-0.66302393 -0.65916281 -0.66893633 -0.67605681]
which, if signs were flipped, would seem in a range compatible with the training score.
I've read the discussion in an issue about cross_val_score flipping the sign of the given scoring function and the solution seemed that neg_*
metrics were being introduced to make such flipping unnecessary and I am using neg_log_loss
. The issue talks about mse
but the arguments seem to apply to log_loss
as well. Is there a way to have cross_val_score
return the same metric as specified in its arguments? Or is this a bug I should file? Or a misunderstanding on my part and sign change is still to be expected from cross_val_score
?
I hope this is a specific enough question for SO. Sklearn
devs redirect users to SO for questions that are not clear-cut bug reports or feature reqs.
Adding minimal repro code per request in comments (sklearn v 0.19.1 python 2.7):
from numpy.random import randn, seed
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import cross_val_score
seed (0)
X = randn(100,2)
y = randn(100)>0
model = LogisticRegressionCV(
solver="sag", scoring="neg_log_loss", verbose=0, n_jobs=-1, cv=10
)
model.fit(X=X, y=y)
model.score(X,y)
cross_val_score(model, X, y, cv=10, scoring="neg_log_loss")
With this code, it doesn't look anymore like it's a simple sign flip for the metric. The outputs are 0.59 for the score and array([-0.70578452, -0.68773683, -0.68627652, -0.69731349, -0.69198876, -0.70089103, -0.69476663, -0.68279466, -0.70066003, -0.68532253])
for the cross validation score.
Note: edited after the fruitful comment thread with Vivek Kumar and piccolbo.
score
method's strange resultsYou found a bug, which was fixed in version 0.20.0
.
From the changelog:
Fix: Fixed a bug in linear_model.LogisticRegressionCV where the score method always computes accuracy, not the metric given by the scoring parameter. #10998 by Thomas Fan.
Also, sklearn's 0.19 LogisticRegressionCV documentation says:
score(X, y, sample_weight=None)
Returns the mean accuracy on the given test data and labels.
While from version 0.20.0
, the docs are updated with the bugfix:
score(X, y, sample_weight=None)
Returns the score using the scoring option on the given test data and labels.
cross_val_score
cross_val_score
flips the result value for error
or loss
metrics, while it preserves the sign for score
metrics. From the documentation:
All scorer objects follow the convention that higher return values are better than lower return values. Thus metrics which measure the distance between the model and the data, like metrics.mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric.