Search code examples
scalaaws-glueaws-glue-spark

How would chaning the read in AWS Glue change a column's data type?


I have a AWS Glue job that was slightly modified, only the read was changed, the job runs fine however the datatypes on my columns have changed. Where I previously had BigInt, I now just have Ints. This is causing an EMR Job dependent on these files to error out due to the schema mismatch. I'm not sure what would cause this issue since the mapping did not change, so if anyone has insight that would be great here is the old & new code:

///OLD
val inputsourceDF = spark.read.format("json").load(inputFilePath)
val inputsource = DynamicFrame(inputsourceDF, glueContext)
///NEW
val inputsource = glueContext.getSourceWithFormat(connectionType = "s3",  options = JsonOptions(Map("paths" -> Set(inputFilePath))), format = "json", transformationContext = "inputsource").getDynamicFrame()

///WRITE which did not change
val inputsink = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(s"""{"path": "$inputOutputFilePath"}"""), transformationContext = "inputdatasink", format = "parquet").writeDynamicFrame(inputdropnullfields.coalesce(inputPartitionCount))

And these are the tables created when crawling the files after the glue job

CREATE EXTERNAL TABLE `input_new`(`id` int)

CREATE EXTERNAL TABLE `input_old`(`id` bigint)

We added this change so that we could utilize Bookmarks, any help would be appreciated.


Solution

  • Both spark DataFrame and glue DynamicFrame infer the schema when reading data from json, but evidently, they do it differently: sparks treats all numerical values as bigint, while glue is trying to be clever, and (I guess) looks at the actual range of values on the fly.

    Some more info about DynamicFrame schema inference can be found here.

    If you are going to write parquet in the end anyway, and want the schema stable and consistent, I'd say your easiest way around this is to just revert your change and go back to spark DataFrame. You can also use apply_mapping to change the types explicitly after reading the data, but it seems like defeating the purpose of having the dynamic frame in the first place.