Search code examples
amazon-web-servicesamazon-s3aws-lambdaparquet

Lambda + awswrangler: Poor performance while handling "large" parquet files


I'm currently writing a Lambda function to read parquet files from 100MB o 200MB on average using Python and the AWS wrangler function. The idea is to read the files and transform them to csv:

import awswrangler as wr
from io import StringIO

print('Loading function')

s3 = boto3.client('s3')
dest_bucket = "mydestbucket"

def lambda_handler(event, context):
    # Get the object from the event and show its content type
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        print("CONTENTO TYPE: " + response['ContentType'])

        if key.endswith('.parquet'):
            dfs = wr.s3.read_parquet(path=['s3://' + bucket + '/' + key], chunked=True, use_threads=True)
            
            count=0
            for df in dfs:
                csv_buffer = StringIO()
                df.to_csv(csv_buffer)
                s3_resource = boto3.resource('s3')
                #s3_resource.Object(dest_bucket, 'dfo.csv').put(Body=df)
                s3_resource.Object(dest_bucket, 'dfo_' + str(count) + '.csv').put(Body=csv_buffer.getvalue())
                count += 1
                
            return "File written" 

The function works ok when I use small files, but once I try with large files (100MB) it gives a timeout.

I already allocated 3GB of memory for Lambda and set a timeout of 10 min, however, it doesn't seem to do the trick.

Do you know how to improve the performance apart from allocating more memory?

Thanks!


Solution

  • I resolved the issue by creating a Layer using fastparquet which handles the memory in a more optimal fashion than aws wrangler