After performing joins and aggregation i want the output to be in 1 file and partition based on some column. when I use repartition(1) the time taken by job is 1 hr and if I remove preparation(1) there will be multiple partitions of that file it takes 30 mins (refer to example below). So is there a way to write data into 1 file ??
...
...
df= df.repartition(1)
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")
Is there any other way to increase the write performance. does changing format helps? and how to achieve parallelism by having 1 file output
S3 storage example
**if repartition(1)** // what I want but takes more time
choice=0/part-00-001
..
..
choice=500/part-00-001
**if removed** // takes less time but multiple files are present
choice=0/part-00-001
....
choice=0/part-00-0032
..
..
choice=500/part-00-001
....
choice=500/part-00-0032
Instead of using df.repartition(1)
USE df.repartition("choice")
df= df.repartition("choice")
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")