Search code examples
amazon-web-servicesamazon-s3pysparkaws-glue

AWS Glue ETL: Reading huge JSON file format to process but, got OutOfMemory Error


I am working on AWS-Glue ETL part for reading huge json file (only test 1 file and around 9 GB.) to work in ETL process but, I got an error from AWS Glue of java.lang.OutOfMemoryError: Java heap space after running and processing for a while

My code and flow is so simple as

df = spark.read.option("multiline", "true").json(f"s3/raw_path") 
// ...
// and write to be as source_df to other object in s3 
df.write.json(f"s3/source_path", lineSep=",\n")

In error/log It seems likes It failed and terminated container since reading this huge file. I have already tried to upgrade worker type to be G1.X with a sample number of worker node, however, I just would like to ask and find another solution that does not look like vertical scaling as increasing resources

I am so new in this area and service so, wanna optimize cost and time as low as possible :-)

Thank you alls in advance


Solution

  • After looking into Glue and Spark, I found that to get the benefit of parallelism processing across multiple executors, for my case - I split the (large) file into multiple smaller files and it worked! The files are distributed to multiple executors.