I am submitting the training through a script file. Following is the content of the train.py
script. Azure ML is treating all these as one run (instead of run per alpha value as coded below) as Run.get_context()
is returning the same Run id.
train.py
from azureml.opendatasets import Diabetes
from azureml.core import Run
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
import math
import os
import logging
# Load dataset
dataset = Diabetes.get_tabular_dataset()
print(dataset.take(1))
df = dataset.to_pandas_dataframe()
df.describe()
# Split X (independent variables) & Y (target variable)
x_df = df.dropna() # Remove rows that have missing values
y_df = x_df.pop("Y") # Y is the label/target variable
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=66)
print('Original dataset size:', df.size)
print("Size after dropping 'na':", x_df.size)
print("Training split size: ", x_train.size)
print("Test split size: ", x_test.size)
# Training
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters
# Create and log interactive runs
output_dir = os.path.join(os.getcwd(), 'outputs')
for hyperparam_alpha in alphas:
# Get the experiment run context
run = Run.get_context()
print("Started run: ", run.id)
run.log("train_split_size", x_train.size)
run.log("test_split_size", x_train.size)
run.log("alpha_value", hyperparam_alpha)
# Train
print("Train ...")
model = Ridge(hyperparam_alpha)
model.fit(X = x_train, y = y_train)
# Predict
print("Predict ...")
y_pred = model.predict(X = x_test)
# Calculate & log error
rmse = math.sqrt(mean_squared_error(y_true = y_test, y_pred = y_pred))
run.log("rmse", rmse)
print("rmse", rmse)
# Serialize the model to local directory
if not os.path.isdir(output_dir):
os.makedirs(output_dir, exist_ok=True)
print("Save model ...")
model_name = "model_alpha_" + str(hyperparam_alpha) + ".pkl" # Pickle file
file_path = os.path.join(output_dir, model_name)
joblib.dump(value = model, filename = file_path)
# Upload the model
run.upload_file(name = model_name, path_or_stream = file_path)
# Complete the run
run.complete()
Authoring code (i.e. control plane)
import os
from azureml.core import Workspace, Experiment, RunConfiguration, ScriptRunConfig, VERSION, Run
ws = Workspace.from_config()
exp = Experiment(workspace = ws, name = "diabetes-local-script-file")
# Create new run config obj
run_local_config = RunConfiguration()
# This means that when we run locally, all dependencies are already provided.
run_local_config.environment.python.user_managed_dependencies = True
# Create new script config
script_run_cfg = ScriptRunConfig(
source_directory = os.path.join(os.getcwd(), 'code'),
script = 'train.py',
run_config = run_local_config)
run = exp.submit(script_run_cfg)
run.wait_for_completion(show_output=True)
run = Run.get_context()
assigns the run object of the run that you're currently in to run
. So in every iteration of the hyperparameter search, you're logging to the same run. To solve this, you need to create child (or sub-) runs for each hyperparameter value. You can do this with run.child_run()
. Below is the template for making this happen.
run = Run.get_context()
for hyperparam_alpha in alphas:
# Get the experiment run context
run_child = run.child_run()
print("Started run: ", run_child.id)
run_child.log("train_split_size", x_train.size)
On the diabetes-local-script-file
Experiment page, you can see that Run 9
was the parent run and Runs 10-19
were the child runs if you click "Include child runs" page. There is also a "Child runs" tab on Run 9 details page.
I highly recommend abstracting the hyperparameter search away from the data plane (i.e. train.py
) and into the control plane (i.e. "authoring code"). This becomes especially valuable as training time increases and you can arbitrarily parallelize and also choose Hyperparameters more intelligently by using Azure ML's Hyperdrive
.
Remove the loop from your code, add the code like below (full data and control here)
import argparse
from pprint import pprint
parser = argparse.ArgumentParser()
parser.add_argument('--alpha', type=float, default=0.5)
args = parser.parse_args()
print("all args:")
pprint(vars(args))
# use the variable like this
model = Ridge(args.alpha)
below is how to submit a single run using a script argument. To submit multiple runs, just use a loop in the control plane.
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters
list_rcs = [ScriptRunConfig(
source_directory = os.path.join(os.getcwd(), 'code'),
script = 'train.py',
arguments=['--alpha',a],
run_config = run_local_config) for a in alphas]
list_runs = [exp.submit(rc) for rc in list_rcs]
In this way you outsource the hyperparameter source to Hyperdrive
. The UI will also report results exactly how you want them, and via the API you can easily download the best model. Note you can't use this locally anymore and must use AMLCompute
, but to me it is a worthwhile trade-off.This is a great overview. Excerpt below (full code here)
param_sampling = GridParameterSampling( {
"alpha": choice(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
}
)
estimator = Estimator(
source_directory = os.path.join(os.getcwd(), 'code'),
entry_script = 'train.py',
compute_target=cpu_cluster,
environment_definition=Environment.get(workspace=ws, name="AzureML-Tutorial")
)
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
hyperparameter_sampling=param_sampling,
policy=None,
primary_metric_name="rmse",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=10,
max_concurrent_runs=4)
run = exp.submit(hyperdrive_run_config)
run.wait_for_completion(show_output=True)