Search code examples
pythonpandasscikit-learndaskjoblib

Why running Sklearn machine learning with Dask doesn't result in parallelism?


I want to perform Machine Learning algorithms from Sklearn library on all my cores using Dask and joblib libraries.

My code for the joblib.parallel_backend with Dask:

#Fire up the Joblib backend with Dask:
with joblib.parallel_backend('dask'):
    model_RFE = RFE(estimator = DecisionTreeClassifier(), n_features_to_select = 5)
    fit_RFE = model_RFE.fit(X_values,Y_values)

Unfortunetly when I look at my task manager I can see all my workers chillin and doing nothing, and only 1 new Python task is doing all the job: enter image description here

Even in my Dask visualization on Client I see the workers doing nothing:

enter image description here

  1. Can you please tell me what am I doing wrong?
  2. Is it my code (whole code below)?
  3. I really just wanna run ML in parallel for speed up. If I don't need to use joblib I would welcome any other ideas.

My whole code attempt following this tutorial from docs:

import pandas as pd

import dask.dataframe as df
from dask.distributed import Client

import sklearn
from sklearn.feature_selection import  RFE
from sklearn.tree import DecisionTreeClassifier

import joblib

#Create cluset on local PC
client = Client(n_workers = 4, threads_per_worker = 1, memory_limit = '4GB')
client

#Read data from .csv
dataframe_lazy = df.read_csv(path, engine = 'c', low_memory = False)
dataframe = dataframe_lazy.compute()

#Get my X and Y values and realse the original DF from memory
X_values = dataframe.drop(columns = ['Id', 'Target'])
Y_values = dataframe['Target']

del dataframe 

#Prepare data
X_values.fillna(0, inplace = True)

#Fire up the Joblib backend with Dask:
with joblib.parallel_backend('dask'):
    model_RFE = RFE(estimator = DecisionTreeClassifier(), n_features_to_select = 5)
    fit_RFE = model_RFE.fit(X_values,Y_values)

Solution

  • The Dask joblib backend will not be able to parallelize all scikit-learn models, only some of them as indicated in the Parallelism docs. This is because many scikit-learn models only support sequential training either due to the algorithm implementations or because parallel support has not been added.

    Dask will only be able to parallelize models that have an n_jobs paramemeter, which indicates that the scikit-learn model is written in a way to support parallel training. RFE and DecisionTreeClassifier do not have an n_jobs paramemter. I wrote this gist that you can run to get a full list of the models that support parallel training