Search code examples
azure-blob-storageazure-machine-learning-service

Can we append data to an existing csv file stored in Azure blob storage through python?


I have a machine learning model deployed in azure designer studio. I need to retrain it everyday with new data through python code. I need to keep the existing csv data in the blob storage and also add some more data to the existing csv and retrain it. If I retrain the model with only the new data, the old data is lost so I need to retrain the model by appending new data to existing data. Is there any way to do it through python coding?

I have also researched about append blob but they add only in the end of the blob. In the documentation, they have mentioned we cannot update or add to an existing blob.


Solution

  • I'm not sure why it has to be one csv file. There are many Python-based libraries for working with a dataset spread across multiple csvs.

    In all of the examples, you pass a glob pattern, that will match multiple files. This pattern works very naturally with Azure ML Dataset which you can use as your input. See this excerpt from the docs link above.

    from azureml.core import Workspace, Datastore, Dataset
    
    datastore_name = 'your datastore name'
    
    # get existing workspace
    workspace = Workspace.from_config()
        
    # retrieve an existing datastore in the workspace by name
    datastore = Datastore.get(workspace, datastore_name)
    
    # create a TabularDataset from 3 file paths in datastore
    datastore_paths = [(datastore, 'weather/2018/11.csv'),
                       (datastore, 'weather/2018/12.csv'),
                       (datastore, 'weather/2019/*.csv')] # here's the glob pattern
    
    weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)
    

    Assuming that all the csvs can fit into memory, you can turn these datasets easily into pandas dataframes. with Azure ML Datasets, you call

    # get the input dataset by name
    dataset = Dataset.get_by_name(ws, name=dataset_name)
    # load the TabularDataset to pandas DataFrame
    df = dataset.to_pandas_dataframe()
    

    With Dask Dataframe, this GitHub issue says you can call

    df = my_dask_df.compute()
    

    As far as output datasets, you can control this by reading in the output CSV as a dataframe, appending data to it then overwriting it to the same location.