amazon-web-services pyspark aws-glue aws-glue-spark aws-glue-workflow

How to set a specific compression value in aws glue? If possible, can the compression level and partitions be determined manually in aws glue?

I am looking to ingest data from a source to s3 using AWS Glue.

Is it possible to compress the ingested data in glue to specified value? For example: Compress the data to 500 MB and also be able to partition data based on compression value provided? if yes, how to enable this? I am writing the glue script in Python.

Solution

Compression & grouping are similar terms. Compression happens with parquet output. However you can use the 'groupSize': '31457280' (30 mb) to specify the size of the dynamic frame (and is the default output size) of the output file (at least most of them, the last file one is gonna be the remainder). Also you need to be careful/leverage the Glue CPU type and quantity. like Maximum capacity 10, Worker type Standard. The G.2X tend to create too many small files (it will/all depend on your situation/inputs.) If you do nothing but read many small files and write them unchanged in a large group, they will be "default compressed/grouped" into the "groupsize". If you want to see drastic reductions in your file written size, then format the output as parquet. glueContext.create_dynamic_frame_from_options(connection_type = "s3", format="json",connection_options = {"paths":"s3://yourbucketname/folder_name/2021/01/"], recurse':True, 'groupFiles':'inPartition', 'groupSize': '31457280'})