I'm trying to use precision_score
with np.nan
for the zero_division
. It's not working with cross_val_score
but working when I do manual cross-validation with the same pairs.
Here's the data files to reproduce: sklearn_data.pkl.zip
# Load in data
with open("sklearn_data.pkl", "rb") as f:
objects = pickle.load(f)
# > objects.keys()
# dict_keys(['estimator', 'X', 'y', 'scoring', 'cv', 'n_jobs'])
estimator = objects["estimator"]
X = objects["X"]
y = objects["y"]
scoring = objects["scoring"]
cv = objects["cv"]
n_jobs = objects["n_jobs"]
# > scoring
# make_scorer(precision_score, pos_label=Case_0, zero_division=nan)
# > y.unique()
# ['Control', 'Case_0']
# Categories (2, object): ['Case_0', 'Control']
# First I checked to make sure that there are both classes in all the training and validation pairs
pos_label = "Case_0"
control_label = "Control"
for index_training, index_validation in cv:
assert y.iloc[index_training].nunique() == 2
assert y.iloc[index_validation].nunique() == 2
assert pos_label in y.values
assert control_label in y.values
# If I run manually:
scores = list()
for index_training, index_validation in cv:
estimator.fit(X.iloc[index_training], y.iloc[index_training])
y_hat = estimator.predict(X.iloc[index_validation])
score = precision_score(y_true = y.iloc[index_validation], y_pred=y_hat, pos_label=pos_label)
scores.append(score)
# > print(np.mean(scores))
# 0.501156937317928
# If I use cross_val_score:
cross_val_score(estimator=estimator, X=X, y=y, cv=cv, scoring=scoring, n_jobs=n_jobs)
# /Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:839: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
# Traceback (most recent call last):
# File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 136, in __call__
# score = scorer._score(
# File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 355, in _score
# return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
# File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 201, in wrapper
# validate_parameter_constraints(
# File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
# raise InvalidParameterError(
# sklearn.utils._param_validation.InvalidParameterError: The 'zero_division' parameter of precision_score must be a float among {0.0, 1.0, nan} or a str among {'warn'}. Got nan instead.
Here's my versions:
System:
python: 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20) [Clang 14.0.6 ]
executable: /Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/bin/python
machine: macOS-13.4.1-x86_64-i386-64bit
Python dependencies:
sklearn: 1.3.1
pip: 22.0.3
setuptools: 60.7.1
numpy: 1.24.4
scipy: 1.8.0
Cython: 0.29.27
pandas: 1.4.0
matplotlib: 3.7.1
joblib: 1.3.2
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/libopenblasp-r0.3.18.dylib
version: 0.3.18
threading_layer: openmp
architecture: Haswell
num_threads: 16
user_api: openmp
internal_api: openmp
prefix: libomp
filepath: /Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/libomp.dylib
version: None
num_threads: 16
This is a bug that I reported here: https://github.com/scikit-learn/scikit-learn/issues/27563
One way around it is to use n_jobs=1.