Search code examples
scalapysparkdata-scienceetlaws-glue

AWS Glue Scala, output one file with partitions


There are lots of questions addressing this with python solutions, but having issues finding anything for Glue. Understood both leverage pyspark, but I'm getting compilation errors when I try to adapt python-based solutions to Scala. Wanted to both ask the question and get a simple reference for anyone else with the same issue.

Basically I generate my output like this

val datasource0 = DynamicFrame(data, glueContext).withName("datasource0").withTransformationContext("datasource0")
val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> "s3://sf_path")),format = "parquet", transformationContext = "datasink2").writeDynamicFrame(datasource0)  

Pyspark being pyspark it generates multiple output files for this transformation. How can I modify my job to only create one output file?


Solution

  • You can use the function repartition on your Scala DynamicFrame. The number of partitions equals the number of output files. More information on that here.

    Code example: val repartitionedDataSource1 = datasource1.repartition(1)