Search code examples
parquetaws-glue

Amazon Glue - Create Single Praquet


I have my data source which generates hourly files in csv format which are pushed to S3. Then using Glue I do some ETL and push the transformed data again back to S3. The other department which consumes this data wants the files to be consolidated into a single file for yesterday. I have written a python program that consolidates yesterday's 24 files into a single CSV file. Now it is also needed that the single consolidated file should also be available in Parquet.
I created a crawler to generate my csv table and then I have a Glue job that converts the single transformed file into Parquet, but I am getting multiple parts of the Parquet file, which I believe because of the snappy compression. But I want to create a single one. How can I do this in Glue ?
Secondly I would like to understand that when to use multiple Parquet files and when it makes sense to create a single one.


Solution

  • You can break out to DataFrames, call repartition(1) and then call write.