azure hyperparameters azure-machine-learning-service

Run.get_context() gives the same run id

I am submitting the training through a script file. Following is the content of the train.py script. Azure ML is treating all these as one run (instead of run per alpha value as coded below) as Run.get_context() is returning the same Run id.

train.py

from azureml.opendatasets import Diabetes
from azureml.core import Run

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib

import math
import os
import logging

# Load dataset
dataset = Diabetes.get_tabular_dataset()
print(dataset.take(1))

df = dataset.to_pandas_dataframe()
df.describe()

# Split X (independent variables) & Y (target variable)
x_df = df.dropna()      # Remove rows that have missing values
y_df = x_df.pop("Y")    # Y is the label/target variable

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=66)
print('Original dataset size:', df.size)
print("Size after dropping 'na':", x_df.size)
print("Training split size: ", x_train.size)
print("Test split size: ", x_test.size)

# Training
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters

# Create and log interactive runs

output_dir = os.path.join(os.getcwd(), 'outputs')

for hyperparam_alpha in alphas:
    # Get the experiment run context
    run = Run.get_context()
    print("Started run: ", run.id)
    run.log("train_split_size", x_train.size)
    run.log("test_split_size", x_train.size)
    run.log("alpha_value", hyperparam_alpha)

    # Train
    print("Train ...")
    model = Ridge(hyperparam_alpha)
    model.fit(X = x_train, y = y_train)
    
    # Predict
    print("Predict ...")
    y_pred = model.predict(X = x_test)

    # Calculate & log error
    rmse = math.sqrt(mean_squared_error(y_true = y_test, y_pred = y_pred))
    run.log("rmse", rmse)
    print("rmse", rmse)

    # Serialize the model to local directory
    if not os.path.isdir(output_dir):
        os.makedirs(output_dir, exist_ok=True) 

    print("Save model ...")
    model_name = "model_alpha_" + str(hyperparam_alpha) + ".pkl" # Pickle file
    file_path = os.path.join(output_dir, model_name)
    joblib.dump(value = model, filename = file_path)

    # Upload the model
    run.upload_file(name = model_name, path_or_stream = file_path)

    # Complete the run
    run.complete()

Experiments view

Authoring code (i.e. control plane)

import os
from azureml.core import Workspace, Experiment, RunConfiguration, ScriptRunConfig, VERSION, Run

ws = Workspace.from_config()
exp = Experiment(workspace = ws, name = "diabetes-local-script-file")

# Create new run config obj
run_local_config = RunConfiguration()

# This means that when we run locally, all dependencies are already provided.
run_local_config.environment.python.user_managed_dependencies = True

# Create new script config
script_run_cfg = ScriptRunConfig(
    source_directory =  os.path.join(os.getcwd(), 'code'),
    script = 'train.py',
    run_config = run_local_config) 

run = exp.submit(script_run_cfg)
run.wait_for_completion(show_output=True)

Solution

Short Answer

Option 1: create child runs within run

run = Run.get_context() assigns the run object of the run that you're currently in to run. So in every iteration of the hyperparameter search, you're logging to the same run. To solve this, you need to create child (or sub-) runs for each hyperparameter value. You can do this with run.child_run(). Below is the template for making this happen.

run = Run.get_context()

for hyperparam_alpha in alphas:
    # Get the experiment run context
    run_child = run.child_run()
    print("Started run: ", run_child.id)
    run_child.log("train_split_size", x_train.size)

On the diabetes-local-script-file Experiment page, you can see that Run 9 was the parent run and Runs 10-19 were the child runs if you click "Include child runs" page. There is also a "Child runs" tab on Run 9 details page.

Long answer

I highly recommend abstracting the hyperparameter search away from the data plane (i.e. train.py) and into the control plane (i.e. "authoring code"). This becomes especially valuable as training time increases and you can arbitrarily parallelize and also choose Hyperparameters more intelligently by using Azure ML's Hyperdrive.

Option 2 Create runs from control plane

Remove the loop from your code, add the code like below (full data and control here)

import argparse
from pprint import pprint

parser = argparse.ArgumentParser()
parser.add_argument('--alpha', type=float, default=0.5)
args = parser.parse_args()
print("all args:")
pprint(vars(args))

# use the variable like this
model = Ridge(args.alpha)

below is how to submit a single run using a script argument. To submit multiple runs, just use a loop in the control plane.

alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters

list_rcs = [ScriptRunConfig(
    source_directory =  os.path.join(os.getcwd(), 'code'),
    script = 'train.py',
    arguments=['--alpha',a],
    run_config = run_local_config) for a in alphas]

list_runs = [exp.submit(rc) for rc in list_rcs]

Option 3 Hyperdrive (IMHO the recommended approach)

In this way you outsource the hyperparameter source to Hyperdrive. The UI will also report results exactly how you want them, and via the API you can easily download the best model. Note you can't use this locally anymore and must use AMLCompute, but to me it is a worthwhile trade-off.This is a great overview. Excerpt below (full code here)

param_sampling = GridParameterSampling( {
        "alpha": choice(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
    }
)

estimator = Estimator(
    source_directory =  os.path.join(os.getcwd(), 'code'),
    entry_script = 'train.py',
    compute_target=cpu_cluster,
    environment_definition=Environment.get(workspace=ws, name="AzureML-Tutorial")
)

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                          hyperparameter_sampling=param_sampling, 
                          policy=None,
                          primary_metric_name="rmse", 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                          max_total_runs=10,
                          max_concurrent_runs=4)

run = exp.submit(hyperdrive_run_config)
run.wait_for_completion(show_output=True)