Search code examples
pythonazurescikit-learncluster-computingazure-machine-learning-service

How to parallelize work on an Azure ML Service Compute cluster?


I am able to submit jobs to Azure ML services using a compute cluster. It works well, and the autoscaling combined with good flexibility for custom environments seems to be exactly what I need. However, so far all these jobs seem to only use one compute node of the cluster. Ideally I would like to use multiple nodes for a computation, but all methods that I see rely on rather deep integration with azure ML services.

My modelling case is a bit atypical. From previous experiments I identified a group of architectures (pipelines of preprocessing steps + estimators in Scikit-learn) that worked well. Hyperparameter tuning for one of these estimators can be performed reasonably fast (couple of minutes) with RandomizedSearchCV. So it seems less effective to parallelize this step.

Now I want to tune and train this entire list of architectures. This should be very easily to parallelize since all architectures can be trained independently.

Ideally I would like something like (in pseudocode)

tuned = AzurePool.map(tune_model, [model1, model2,...])

However, I could not find any resources on how I could achieve this with an Azure ML Compute cluster. An acceptable alternative would come in the form of a plug-and-play substitute for sklearn's CV-tuning methods, similar to the ones provided in dask or spark.


Solution

  • There are a number of ways you could tackle this with AzureML. The simplest would be to just launch a number of jobs using the AzureML Python SDK (the underlying example is taken from here)

    from azureml.train.sklearn import SKLearn
    
    runs = []
    
    for kernel in ['linear', 'rbf', 'poly', 'sigmoid']:
        for penalty in [0.5, 1, 1.5]:
            print ('submitting run for kernel', kernel, 'penalty', penalty)
            script_params = {
                '--kernel': kernel,
                '--penalty': penalty,
            }
    
            estimator = SKLearn(source_directory=project_folder, 
                                script_params=script_params,
                                compute_target=compute_target,
                                entry_script='train_iris.py',
                                pip_packages=['joblib==0.13.2'])
    
            runs.append(experiment.submit(estimator))
    

    The above requires you to factor your training out into a script (or a set of scripts in a folder) along with the python packages required. The above estimator is a convenience wrapper for using Scikit Learn. There are also estimators for Tensorflow, Pytorch, Chainer and a generic one (azureml.train.estimator.Estimator) -- they all differ in the Python packages and base docker they use.

    A second option, if you are actually tuning parameters, is to use the HyperDrive service like so (using the same SKLearn Estimator as above):

    from azureml.train.sklearn import SKLearn
    from azureml.train.hyperdrive.runconfig import HyperDriveConfig
    from azureml.train.hyperdrive.sampling import RandomParameterSampling
    from azureml.train.hyperdrive.run import PrimaryMetricGoal
    from azureml.train.hyperdrive.parameter_expressions import choice
    
    estimator = SKLearn(source_directory=project_folder, 
                        script_params=script_params,
                        compute_target=compute_target,
                        entry_script='train_iris.py',
                        pip_packages=['joblib==0.13.2'])
    
    param_sampling = RandomParameterSampling( {
        "--kernel": choice('linear', 'rbf', 'poly', 'sigmoid'),
        "--penalty": choice(0.5, 1, 1.5)
        }
    )
    
    hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                             hyperparameter_sampling=param_sampling, 
                                             primary_metric_name='Accuracy',
                                             primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                             max_total_runs=12,
                                             max_concurrent_runs=4)
    
    hyperdrive_run = experiment.submit(hyperdrive_run_config)
    

    Or you could use DASK to schedule the work as you were mentioning. Here is a sample of how to set up DASK on and AzureML Compute Cluster so you can do interactive work on it: https://github.com/danielsc/azureml-and-dask