There are lots of questions addressing this with python solutions, but having issues finding anything for Glue. Understood both leverage pyspark, but I'm getting compilation errors when I try to adapt python-based solutions to Scala. Wanted to both ask the question and get a simple reference for anyone else with the same issue.
Basically I generate my output like this
val datasource0 = DynamicFrame(data, glueContext).withName("datasource0").withTransformationContext("datasource0")
val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> "s3://sf_path")),format = "parquet", transformationContext = "datasink2").writeDynamicFrame(datasource0)
Pyspark being pyspark it generates multiple output files for this transformation. How can I modify my job to only create one output file?
You can use the function repartition
on your Scala DynamicFrame
. The number of partitions equals the number of output files. More information on that here.
Code example: val repartitionedDataSource1 = datasource1.repartition(1)