Search code examples
azure-machine-learning-service

ScriptRunConfig with datastore reference on AML


When trying to run a ScriptRunConfig, using :

src = ScriptRunConfig(source_directory=project_folder, 
                      script='train.py', 
                      arguments=['--input-data-dir', ds.as_mount(),
                                 '--reg', '0.99'],
                      run_config=run_config) 
run = experiment.submit(config=src)

It doesn't work and breaks with this when I submit the job :

... lots of things... and then
TypeError: Object of type 'DataReference' is not JSON serializable

However if I run it with the Estimator, it works. One of the differences is the fact that with a ScriptRunConfig we're using a list for parameters and the other is a dictionary.

Thanks for any pointers!


Solution

  • Being able to use DataReference in ScriptRunConfig is a bit more involved than doing just ds.as_mount(). You will need to convert it into a string in arguments and then update the RunConfiguration's data_references section with the DataReferenceConfiguration created from ds. Please see here for an example notebook on how to do that.

    If you are just reading from the input location and not doing any writes to it, please check out Dataset. It allows you to do exactly what you are doing without doing anything extra. Here is an example notebook that shows this in action.

    Below is a short version of the notebook

    from azureml.core import Dataset
    
    # more imports and code
    
    ds = Datastore(workspace, 'mydatastore')
    dataset = Dataset.File.from_files(path=(ds, 'path/to/input-data/within-datastore'))
    
    src = ScriptRunConfig(source_directory=project_folder, 
                          script='train.py', 
                          arguments=['--input-data-dir', dataset.as_named_input('input').as_mount(),
                                     '--reg', '0.99'],
                          run_config=run_config) 
    run = experiment.submit(config=src)