Search code examples
azurebatch-processingazure-machine-learning-serviceazureml-python-sdk

How to output data to Azure ML Batch Endpoint correctly using python?


When invoking Azure ML Batch Endpoints (creating jobs for inferencing), the run() method should return a pandas DataFrame or an array as explained here

However this example shown, doesn't represent an output with headers for a csv, as it is often needed.

The first thing I've tried was to return the data as a pandas DataFrame and the result is just a simple csv with a single column and without the headers.

When trying to pass the values with several columns and it's corresponding headers, to be later saved as csv, as a result, I'm getting awkward square brackets (representing the lists in python) and the apostrophes (representing strings)

I haven't been able to find documentation elsewhere, to fix this: enter image description here


Solution

  • This is the way I found to create a clean output in csv format using python, from a batch endpoint invoke in AzureML:

    def run(mini_batch):
        batch = []
        for file_path in mini_batch:
            df = pd.read_csv(file_path)
            
            # Do any data quality verification here:
            if 'id' not in df.columns:
                logger.error("ERROR: CSV file uploaded without id column")
                return None
            else:
                df['id'] = df['id'].astype(str)
    
            # Now we need to create the predictions, with previously loaded model in init():
            df['prediction'] = model.predict(df)
            # or alternative, df[MULTILABEL_LIST] = model.predict(df)
    
            batch.append(df)
    
        batch_df = pd.concat(batch)
    
        # After joining all data, we create the columns headers as a string,
        # here we remove the square brackets and apostrophes:
        azureml_columns = str(batch_df.columns.tolist())[1:-1].replace('\'','')
        result = []
        result.append(azureml_columns)
    
        # Now we have to parse all values as strings, row by row, 
        # adding a comma between each value
        for row in batch_df.iterrows():
            azureml_row = str(row[1].values).replace(' ', ',')[1:-1].replace('\'','').replace('\n','')
            result.append(azureml_row)
    
        logger.info("Finished Run")
        return result