Search code examples
pythonazure-machine-learning-service

Azure Machine Learning Studio designer - "create new version" unexpected when registering a data set


I am trying to register a data set as a Python step with the Azure Machine Learning Studio designer. Here is my code:

import pandas as pd
from azureml.core import Workspace, Run, Dataset

def azureml_main(dataframe1 = None, dataframe2 = None):
    run = Run.get_context()
    ws = run. experiment.workspace
    ds = Dataset.from_pandas_dataframe(dataframe1)
    ds.register(workspace = ws,
                name = "data set name",
                description = "example description",
                create_new_version = True)
    return dataframe1, 

I get an error saying that "create_new_version" in the ds.register line was an unexpected keyword argument. However, this keyword appears in the documentation and I need it to keep track of new versions of the file.

If I remove the argument, I get a different error: "Local data source path not supported for this operation", so it still does not work. Any help is appreciated. Thanks!


Solution

  • update

    sharing OP's solution here for easier discovery

    import pandas as pd
    from azureml.core import Workspace, Run, Dataset
    
    def azureml_main(dataframe1 = None, dataframe2 = None):
        run = Run.get_context()
        ws = run. experiment.workspace
        datastore = ws.get_default_datastore()
        ds = Dataset.Tabular.register_pandas_dataframe(
            dataframe1, datastore, 'data_set_name',
            description = 'data set description.')
        return dataframe1,
    

    original answer

    Sorry you're struggling. You're very close!

    A few things may be the culprit here.

    1. It looks like you're using the Dataset class, which has been deprecated. I recommend trying Dataset.Tabular.register_pandas_dataframe() (docs link) instead of Dataset.from_pandas_dataframe(). (more about the Dataset API deprecation)
    2. More conjectire here, but another thing is there might be some limitations to using dataset registration within an "Execute Python Script" (EPS) module due to:
      1. the workspace object might not have the right permissions
      2. you might not be able to use the register_pandas_dataframe method inside the EPS module, but might have better luck with save the dataframe first to parquet, then calling Dataset.Tabular.from_parquet_files

    Hopefully something works here!