Search code examples
apache-sparkpysparkaws-glue

AWS Glue performance when write


After performing joins and aggregation i want the output to be in 1 file and partition based on some column. when I use repartition(1) the time taken by job is 1 hr and if I remove preparation(1) there will be multiple partitions of that file it takes 30 mins (refer to example below). So is there a way to write data into 1 file ??

...
...
df= df.repartition(1)
glueContext.write_dynamic_frame.from_options(
    frame = df,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://s3path"
        "partitionKeys": ["choice"]
        }, 
    format = "csv",  
    transformation_ctx = "datasink2")

Is there any other way to increase the write performance. does changing format helps? and how to achieve parallelism by having 1 file output

S3 storage example

**if repartition(1)** // what I want but takes more time
choice=0/part-00-001
..
..
choice=500/part-00-001

**if removed** // takes less time but multiple files are present
choice=0/part-00-001
 ....
 choice=0/part-00-0032
..
..
choice=500/part-00-001
 ....
 choice=500/part-00-0032

Solution

  • Instead of using df.repartition(1)

    USE df.repartition("choice")

    df= df.repartition("choice")
    glueContext.write_dynamic_frame.from_options(
        frame = df,
        connection_type = "s3", 
        connection_options = {
            "path": "s3://s3path"
            "partitionKeys": ["choice"]
            }, 
        format = "csv",  
        transformation_ctx = "datasink2")