python azure scikit-learn cluster-computing azure-machine-learning-service

How to parallelize work on an Azure ML Service Compute cluster?

I am able to submit jobs to Azure ML services using a compute cluster. It works well, and the autoscaling combined with good flexibility for custom environments seems to be exactly what I need. However, so far all these jobs seem to only use one compute node of the cluster. Ideally I would like to use multiple nodes for a computation, but all methods that I see rely on rather deep integration with azure ML services.

My modelling case is a bit atypical. From previous experiments I identified a group of architectures (pipelines of preprocessing steps + estimators in Scikit-learn) that worked well. Hyperparameter tuning for one of these estimators can be performed reasonably fast (couple of minutes) with RandomizedSearchCV. So it seems less effective to parallelize this step.

Now I want to tune and train this entire list of architectures. This should be very easily to parallelize since all architectures can be trained independently.

Ideally I would like something like (in pseudocode)

tuned = AzurePool.map(tune_model, [model1, model2,...])

However, I could not find any resources on how I could achieve this with an Azure ML Compute cluster. An acceptable alternative would come in the form of a plug-and-play substitute for sklearn's CV-tuning methods, similar to the ones provided in dask or spark.

Solution

There are a number of ways you could tackle this with AzureML. The simplest would be to just launch a number of jobs using the AzureML Python SDK (the underlying example is taken from here)

from azureml.train.sklearn import SKLearn

runs = []

for kernel in ['linear', 'rbf', 'poly', 'sigmoid']:
    for penalty in [0.5, 1, 1.5]:
        print ('submitting run for kernel', kernel, 'penalty', penalty)
        script_params = {
            '--kernel': kernel,
            '--penalty': penalty,
        }

        estimator = SKLearn(source_directory=project_folder, 
                            script_params=script_params,
                            compute_target=compute_target,
                            entry_script='train_iris.py',
                            pip_packages=['joblib==0.13.2'])

        runs.append(experiment.submit(estimator))

The above requires you to factor your training out into a script (or a set of scripts in a folder) along with the python packages required. The above estimator is a convenience wrapper for using Scikit Learn. There are also estimators for Tensorflow, Pytorch, Chainer and a generic one (azureml.train.estimator.Estimator) -- they all differ in the Python packages and base docker they use.

A second option, if you are actually tuning parameters, is to use the HyperDrive service like so (using the same SKLearn Estimator as above):

from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice

estimator = SKLearn(source_directory=project_folder, 
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script='train_iris.py',
                    pip_packages=['joblib==0.13.2'])

param_sampling = RandomParameterSampling( {
    "--kernel": choice('linear', 'rbf', 'poly', 'sigmoid'),
    "--penalty": choice(0.5, 1, 1.5)
    }
)

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                         hyperparameter_sampling=param_sampling, 
                                         primary_metric_name='Accuracy',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=12,
                                         max_concurrent_runs=4)

hyperdrive_run = experiment.submit(hyperdrive_run_config)

Or you could use DASK to schedule the work as you were mentioning. Here is a sample of how to set up DASK on and AzureML Compute Cluster so you can do interactive work on it: https://github.com/danielsc/azureml-and-dask