azure-blob-storage azure-machine-learning-service

Can we append data to an existing csv file stored in Azure blob storage through python?

I have a machine learning model deployed in azure designer studio. I need to retrain it everyday with new data through python code. I need to keep the existing csv data in the blob storage and also add some more data to the existing csv and retrain it. If I retrain the model with only the new data, the old data is lost so I need to retrain the model by appending new data to existing data. Is there any way to do it through python coding?

I have also researched about append blob but they add only in the end of the blob. In the documentation, they have mentioned we cannot update or add to an existing blob.

Solution

I'm not sure why it has to be one csv file. There are many Python-based libraries for working with a dataset spread across multiple csvs.

In all of the examples, you pass a glob pattern, that will match multiple files. This pattern works very naturally with Azure ML Dataset which you can use as your input. See this excerpt from the docs link above.

from azureml.core import Workspace, Datastore, Dataset

datastore_name = 'your datastore name'

# get existing workspace
workspace = Workspace.from_config()
    
# retrieve an existing datastore in the workspace by name
datastore = Datastore.get(workspace, datastore_name)

# create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')] # here's the glob pattern

weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

Assuming that all the csvs can fit into memory, you can turn these datasets easily into pandas dataframes. with Azure ML Datasets, you call

# get the input dataset by name
dataset = Dataset.get_by_name(ws, name=dataset_name)
# load the TabularDataset to pandas DataFrame
df = dataset.to_pandas_dataframe()

With Dask Dataframe, this GitHub issue says you can call

df = my_dask_df.compute()

As far as output datasets, you can control this by reading in the output CSV as a dataframe, appending data to it then overwriting it to the same location.