python-3.x scikit-learn scipy thread-safety scipy-optimize

Calling `scipy.optimize.minimize` inside an `sklearn` classifier makes it break in a parallel job

I have ran into a silent crash that I am attributing to breaking thread-safety.

Here is the details of what happened. First I have defined a custom sklearn estimator, that uses scipy.optimize at fitting time, similar to:

class CustomClassifier(BaseEstimator, ClassifierMixin):

    ...

    def fit(self, X, y=None):
        ...
        #optimizes with respect to some metric by using scipy.optimize.minimize
        ...
        return self

    ...

Downstream, I run a cross-validated measurement of its performance, looking like:

cv_errors = cross_val_score( CustomClassifier(), X, y, n_jobs=-1)

cross_val_score is the sklearn out-of-the-box function, n_jobs=-1 means that I am asking for it to be parallelised on as many cores as available.

The output is that my cv_errors is an array of NaNs. After doing some bug chasing, I notice that setting n_jobs=1 gives me an array populated by the errors, as expected. It looks like the parallelisation step, coupled with the use of scipy.optimize.minimize is the culprit.

Is there are way to have it working in parallel?

Solution

I think I found a way around here:

with parallel_backend('multiprocessing'):
    cv_errors = cross_val_score( CustomClassifier(), X, y, n_jobs=-1, error_score='raise')

seems to be safe here. If anyone has explanation of what is happening behind the scenes, and why the 'locky' backend breaks while 'multiprocessing' does not, I am listening. Also, setting error_score='raise' means that a crash will not be silenced.