azure batch-processing azure-machine-learning-service azureml-python-sdk

How to output data to Azure ML Batch Endpoint correctly using python?

When invoking Azure ML Batch Endpoints (creating jobs for inferencing), the run() method should return a pandas DataFrame or an array as explained here

However this example shown, doesn't represent an output with headers for a csv, as it is often needed.

The first thing I've tried was to return the data as a pandas DataFrame and the result is just a simple csv with a single column and without the headers.

When trying to pass the values with several columns and it's corresponding headers, to be later saved as csv, as a result, I'm getting awkward square brackets (representing the lists in python) and the apostrophes (representing strings)

I haven't been able to find documentation elsewhere, to fix this:

Solution

This is the way I found to create a clean output in csv format using python, from a batch endpoint invoke in AzureML:

def run(mini_batch):
    batch = []
    for file_path in mini_batch:
        df = pd.read_csv(file_path)
        
        # Do any data quality verification here:
        if 'id' not in df.columns:
            logger.error("ERROR: CSV file uploaded without id column")
            return None
        else:
            df['id'] = df['id'].astype(str)

        # Now we need to create the predictions, with previously loaded model in init():
        df['prediction'] = model.predict(df)
        # or alternative, df[MULTILABEL_LIST] = model.predict(df)

        batch.append(df)

    batch_df = pd.concat(batch)

    # After joining all data, we create the columns headers as a string,
    # here we remove the square brackets and apostrophes:
    azureml_columns = str(batch_df.columns.tolist())[1:-1].replace('\'','')
    result = []
    result.append(azureml_columns)

    # Now we have to parse all values as strings, row by row, 
    # adding a comma between each value
    for row in batch_df.iterrows():
        azureml_row = str(row[1].values).replace(' ', ',')[1:-1].replace('\'','').replace('\n','')
        result.append(azureml_row)

    logger.info("Finished Run")
    return result