python json amazon-web-services aws-lambda parquet

Convert and split large JSON files to smaller Parquet files

I have a bit over 1200 JSON-files in AWS S3 that I need to convert to Parquet and split into smaller files (I am preparing them for Redshift Spectrum). I have tried to create a Lambda-function that does this for me per file. But the function takes too long to complete or consumes to much memory and therefore ends before completion. The files are around 3-6 GB.

Btw. I use Python.

I do not want to fire up a EC2 for this, since that takes forever to complete.

I would like some advise on how to accomplish this.

Solution

AWS Glue is useful for this kind of task. You can create a glue job to convert json format day to parquet format and save it to a S3 bucket of your choice. https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/