I am able to submit jobs to Azure ML services using a compute cluster. It works well, and the autoscaling combined with good flexibility for custom environments seems to be exactly what I need. However, so far all these jobs seem to only use one compute node of the cluster. Ideally I would like to use multiple nodes for a computation, but all methods that I see rely on rather deep integration with azure ML services.
My modelling case is a bit atypical. From previous experiments I identified a group of architectures (pipelines of preprocessing steps + estimators in Scikit-learn) that worked well. Hyperparameter tuning for one of these estimators can be performed reasonably fast (couple of minutes) with RandomizedSearchCV. So it seems less effective to parallelize this step.
Now I want to tune and train this entire list of architectures. This should be very easily to parallelize since all architectures can be trained independently.
Ideally I would like something like (in pseudocode)
tuned = AzurePool.map(tune_model, [model1, model2,...])
However, I could not find any resources on how I could achieve this with an Azure ML Compute cluster. An acceptable alternative would come in the form of a plug-and-play substitute for sklearn's CV-tuning methods, similar to the ones provided in dask or spark.
There are a number of ways you could tackle this with AzureML. The simplest would be to just launch a number of jobs using the AzureML Python SDK (the underlying example is taken from here)
from azureml.train.sklearn import SKLearn
runs = []
for kernel in ['linear', 'rbf', 'poly', 'sigmoid']:
for penalty in [0.5, 1, 1.5]:
print ('submitting run for kernel', kernel, 'penalty', penalty)
script_params = {
'--kernel': kernel,
'--penalty': penalty,
}
estimator = SKLearn(source_directory=project_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train_iris.py',
pip_packages=['joblib==0.13.2'])
runs.append(experiment.submit(estimator))
The above requires you to factor your training out into a script (or a set of scripts in a folder) along with the python packages required. The above estimator is a convenience wrapper for using Scikit Learn. There are also estimators for Tensorflow, Pytorch, Chainer and a generic one (azureml.train.estimator.Estimator
) -- they all differ in the Python packages and base docker they use.
A second option, if you are actually tuning parameters, is to use the HyperDrive service like so (using the same SKLearn
Estimator as above):
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice
estimator = SKLearn(source_directory=project_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train_iris.py',
pip_packages=['joblib==0.13.2'])
param_sampling = RandomParameterSampling( {
"--kernel": choice('linear', 'rbf', 'poly', 'sigmoid'),
"--penalty": choice(0.5, 1, 1.5)
}
)
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
hyperparameter_sampling=param_sampling,
primary_metric_name='Accuracy',
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=12,
max_concurrent_runs=4)
hyperdrive_run = experiment.submit(hyperdrive_run_config)
Or you could use DASK to schedule the work as you were mentioning. Here is a sample of how to set up DASK on and AzureML Compute Cluster so you can do interactive work on it: https://github.com/danielsc/azureml-and-dask