I have ran into a silent crash that I am attributing to breaking thread-safety.
Here is the details of what happened. First I have defined a custom sklearn
estimator, that uses scipy.optimize
at fitting time, similar to:
class CustomClassifier(BaseEstimator, ClassifierMixin):
...
def fit(self, X, y=None):
...
#optimizes with respect to some metric by using scipy.optimize.minimize
...
return self
...
Downstream, I run a cross-validated measurement of its performance, looking like:
cv_errors = cross_val_score( CustomClassifier(), X, y, n_jobs=-1)
cross_val_score
is the sklearn
out-of-the-box function, n_jobs=-1
means that I am asking for it to be parallelised on as many cores as available.
The output is that my cv_errors
is an array of NaN
s. After doing some bug chasing, I notice that setting n_jobs=1
gives me an array populated by the errors, as expected. It looks like the parallelisation step, coupled with the use of scipy.optimize.minimize
is the culprit.
Is there are way to have it working in parallel?
I think I found a way around here:
with parallel_backend('multiprocessing'):
cv_errors = cross_val_score( CustomClassifier(), X, y, n_jobs=-1, error_score='raise')
seems to be safe here. If anyone has explanation of what is happening behind the scenes, and why the 'locky' backend breaks while 'multiprocessing' does not, I am listening. Also, setting error_score='raise'
means that a crash will not be silenced.