Is there a way that we could update an Existing Azure ML Dataset using a pandas Dataframe and update the version? The default Dataset is stored in a blob as a csv file.How can we approach this?
Also let's say we want to change the latest version to another version.
Above we see that version 2 is the latest, but I want to change the latest to version 1 so that if anyone reads the Dataset it will be from version 1. Don't want to use versions specifically to retrieve it.
Regarding your first question, here are two methods to update your Azure ML dataset with a new version using a CSV file stored in Blob Storage:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
blob_url = 'https://sampleazurestorage.blob.core.windows.net/data/my-sample-data.csv'
my_dataset = Data(
path=blob_url ,
type=AssetTypes.MLTABLE,
description="a description for your dataset",
name="dataset_name",
version='<new_version>'
)
ml_client.data.create_or_update(my_dataset)
import azureml.core
from azureml.core import Dataset, Workspace
ws = Workspace.from_config()
datastore = ws.get_default_datastore()
blob_url = 'https://sampleazurestorage.blob.core.windows.net/data/my-sample-data.csv'
my_dataset = Dataset.File.from_delimited_files(path=blob_url)
my_dataset.register(
workspace=ws,
name="dataset_name",
description="a description for your dataset",
create_new_version=True
)
If you want to update the dataset using a pandas DataFrame:
my_df = ... # the variable that contains the new dataset in a DataFrame
my_dataset = Dataset.File.from_pandas_dataframe(dataframe=my_df)
my_dataset.register(
...
)
Regarding your second question:
Above we see that version 2 is the latest, but I want to change the latest to version 1
It is not possible since 'latest' always points to the last (latest) uploaded version of the dataset with the given name. So, if you want a specific or latest version, you should change the version
parameter in the Data
class in the "Method 1" code snippet.