Search code examples
amazon-web-servicesamazon-s3amazon-sagemakerxgboostcatboost

More efficient way to stream data to AWS Batch Transform Job


I have a sagemaker process for training and running inference on data in sagemaker:

  1. processing job: read input csv files from s3 and clean up the data, output csv files to s3
  2. processing job: read in the cleaned csv data from s3 and split the for training, output csv files to s3
  3. training job: read in the X_train, y_train, X_test, y_test csv files from s3 and run training using catboost or xgboost, output trained parameters to s3
  4. batch transform job (inference): stream inference data from s3 to HTTP server within docker container using the previously trained parameters from s3 as the model, output scored data to s3

I wanted to swap out the file type of csv to parquet but the issue I have is that the batch transform job doesn't mount the inference data from s3 onto the container and instead forces you to use an HTTP server for streaming the data from s3 to the container for processing. As well the max payload size is set by sagemaker at 10Mb I believe. I was able to get this working with csv files by setting the SplitType = Line. My understanding is that I cannot use SplitType = Line with parquet files.

Does anyone have a method that works for streaming large amounts of data to a batch transform job that doesn't use csv as file type? I would like to use parquet files but the only solution I can think of is using another processing job to split the parquet files into 10Mb or less so the batch transform job can work by streaming an individual parquet file.

My previous process used csv file types with the sagemaker SplitType = Line so I didn't have to manage the size of the csv files myself. Attached is my config dict for the batch transform job but I am mainly looking for options on how I can approach this problem. What alternatives exists since the csv option works but is not ideal, and the parquet option doesn't work without splitting the parquet files into 10Mb or less file size. I am just assuming there must be a straight forward way of streaming data from s3 to the sagemaker batch transform job without having to manually control files sizes for parquet and without having to use csv files which introduce other problems.

inference_job_config = {
    'TransformJobName': inference_job_name,
    'ModelName': self.model_name,
    'TransformInput': {
        'DataSource': {
            'S3DataSource': {
                'S3Uri': f's3://{self.s3_bucket}/{self.s3_preprocessed_inference}/',
                'S3DataType': 'S3Prefix'
            }
        },
        'ContentType': 'text/csv',
        'CompressionType': 'None',
        'SplitType': 'Line'
    },
    'TransformOutput': {
        'S3OutputPath': f's3://{self.s3_bucket}/{self.s3_inference_results}/',
        'AssembleWith': 'None'
    },
    'TransformResources': {
        'InstanceType': self.instance_type,
        'InstanceCount': 1
    },
    'DataProcessing': {
        'JoinSource': 'None'
    }
}

I saw one post on here that showed how you can accept parquet files to the HTTP server on the container running the batch transform job, but this still doesn't solve the file size issue as far as I see. If the file sizes are preprocessed before hand to be 10Mb or less than this works and I can do it but I was hoping for something more straight forward like the csv option but with file types that are better than csv.


Solution

  • This isn't the exact answer I was looking for but is a solution to the problem. I just swapped out the batch transform job with another processing job. The sagemaker processing jobs mount data from s3 to the container that your provide to run your python scripts. so the processing job can mount the input data and the model parameters. This eliminates the need for streaming your data to the HTTP server running in the container for the batch transform job. Not sure if there is some reason why you would not want to do this since it seems decently simpler than using the batch transform job which hard limits you to 10MB payload size per packet when streaming to HTTP server which makes it tough to deal with larger file sizes.