Search code examples

MLflow nested runs not grouping together in GUI

I'm learning MLflow in Databricks using the tutorial The tutorial includes using nested MLflow runs for hyperparameter optimization of XGBoost. A parent run is created via

with mlflow.start_run(run_name='xgboost_models'):
    best_params = fmin(

which invokes the model training process defined by

def train_model(params):
    with mlflow.start_run(nested=True):
        train = xgb.DMatrix(data=X_train, label=y_train)
        validation = xgb.DMatrix(data=X_val, label=y_val)
        # Additional training code here

The successful result is that on the Databricks default Experiments page (i.e., MLflow GUI pointing to default location), I see a run called xgboost_models that can be expanded to show a list of child runs where actual ML training was performed. The parent-child grouping as instructed by mlflow.start_run(nested=True) came out nicely.

Trouble comes when I decide that my runs should be logged to an Experiments location that I choose myself, instead of the default location in Databricks. First, I create the new location:


# Get the experiment ID if it exists, or create a new one
experiment_id = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

if experiment_id is None:
    # If the experiment does not exist, create it
    experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
    # If the experiment exists, get its ID
    experiment_id = experiment_id.experiment_id

This goes well, in the sense that if I execute a single unnested ML run via with mlflow.start_run(experiment_id=experiment_id, run_name='untuned_random_forest'), the new log for run untuned_random_forest shows up on the dxxxx_minimal_MLflow Experiments page.

It really gets weird when I try this with the hyperopt nested runs. If I modify the outer call to read

with mlflow.start_run(experiment_id=experiment_id, run_name='xgboost_models_2'):
    best_params = fmin(

and change nothing else, my new parent run xgboost_models_2 shows up on the dxxxx_minimal_MLflow experiment page with no children. And all the child runs show up on back on the default experiment page with no parent -- which is pretty hideous!

Checking on the detail, it may be important to note that the child runs do have a Parent ID tag, and its value seems to be set correctly to point to the ID corresponding to the xgboost_models_2 parent run. This leads me to suspect that the nested argument to mlflow.start_run(nested=True) is doing its job well, and somehow the GUI is simply failing to interpret the parent-child relationship correctly.


  1. Anyone got a fix?
  2. Anyone able to clue me in about whether this is a general MLflow problem or just a Databricks problem?

Footnote: I've tried to fix this by shoving additional parameters into the child invocations of mlflow.start_run(), such as experiment_id and parent_run_id, but that seems to make no difference. And that seems very reasonable, because as I noted above, the child runs seem to be correctly tagged with the Parent Run ID in the first place.


  • So, a solution.

    By logging some extra parameters from the child runs, I determined that my MLflow environment (by whose fault, I can't say) creates the child runs with a different experiment_id parameter value than that of the parent run, seemingly in total defiance of nested=True and in utter disregard for any parameters like experiment_id or parent_run_id that I might pass into the child invocation of mlflow.start_run().

    However, we can set experiment_id globally at the point where we initially created/obtained the desired experiment_id in the first place. I mean, the block that sets and uses EXPERIMENT_NAME. Just add the following line to the end of that block:


    (But still, the failure of nested=True doesn't seem like a very nice thing.)