Search code examples
daskazure-machine-learning-service

Use a Dask Cluster in a PythonScriptStep


Is it possible to have a multi-node Dask cluster be the compute for a PythonScriptStep with AML Pipelines?

We have a PythonScriptStep that uses featuretools's, deep feature synthesis (dfs) (docs). ft.dfs() has a param, n_jobs which allows for parallelization. When we run on a single machine, the job takes three hours, and runs much faster on a Dask. How can I operationalize this within an Azure ML pipeline?


Solution

  • We've been working and recently released a dask_cloudprovider.AzureMLCluster that might be of interest to you: link to repo. You can install it via pip install dask-cloudprovider.

    The AzureMLCluster instantiates Dask cluster on AzureML service with elasticity of scaling up to 100s of nodes should you require that. The only required parameter is the Workspace object, but you can pass your own ComputeTarget should you choose to.

    An example of how to use it you can found here. In this example I use my custom GPU/RAPIDS docker image but you can use any images within the Environment class.