Search code examples
pythonpandasazureazure-machine-learning-servicemlops

How to Update a Azure ML Dataset with a new pandas DataFrame and How to Revert to a Specific Version if Needed


Is there a way that we could update an Existing Azure ML Dataset using a pandas Dataframe and update the version? The default Dataset is stored in a blob as a csv file.How can we approach this?

Also let's say we want to change the latest version to another version.

enter image description here

Above we see that version 2 is the latest, but I want to change the latest to version 1 so that if anyone reads the Dataset it will be from version 1. Don't want to use versions specifically to retrieve it.


Solution

  • Regarding your first question, here are two methods to update your Azure ML dataset with a new version using a CSV file stored in Blob Storage:

    Method 1:

    from azure.ai.ml.entities import Data
    from azure.ai.ml.constants import AssetTypes
    
    blob_url = 'https://sampleazurestorage.blob.core.windows.net/data/my-sample-data.csv'
    
    my_dataset = Data(
        path=blob_url ,
        type=AssetTypes.MLTABLE,
        description="a description for your dataset",
        name="dataset_name",
        version='<new_version>'
    )
    
    ml_client.data.create_or_update(my_dataset)
    

    Method 2:

    import azureml.core
    from azureml.core import Dataset, Workspace
    
    ws = Workspace.from_config()
    datastore = ws.get_default_datastore()
    
    blob_url = 'https://sampleazurestorage.blob.core.windows.net/data/my-sample-data.csv'
    
    my_dataset = Dataset.File.from_delimited_files(path=blob_url)
    my_dataset.register(
        workspace=ws,
        name="dataset_name",
        description="a description for your dataset",
        create_new_version=True
    )
    

    If you want to update the dataset using a pandas DataFrame:

    my_df = ...  # the variable that contains the new dataset in a DataFrame
    my_dataset = Dataset.File.from_pandas_dataframe(dataframe=my_df)
    my_dataset.register(
        ...
    )
    

    Regarding your second question:

    Above we see that version 2 is the latest, but I want to change the latest to version 1

    It is not possible since 'latest' always points to the last (latest) uploaded version of the dataset with the given name. So, if you want a specific or latest version, you should change the version parameter in the Data class in the "Method 1" code snippet.