Search code examples
aws-glueaws-cdk

Glue Fails to Convert Date


Glue fails to convert a date in the format "yyyy/MM/dd" to "yyyy-MM-dd" when reading the data from a Glue DynamicFrame and writing it to an RDS Postgresql instance. The field is detected as type string in the Glue catalog, and the database type is date.

The DynamicFrame is created using:

df = glueContext.create_dynamic_frame.from_catalog(database=glue_db, table_name=glue_tbl_name, push_down_predicate=pushdown_predicate, transformation_ctx="df")

It is written to the database using:

output = glueContext.write_dynamic_frame.from_jdbc_conf(frame = df, catalog_connection = db_connection, 
        connection_options = {
            "database" : target_db, 
            "dbtable" : staging_table
            })

Fields containing dates in format "yyyy/MM/dd" are null in the database. Fields containing dates in format "yyyy-mm-dd" are set. I verified that the source date is not null by querying the catalogued S3 File using Athena.

I attempted to reformat the dates to "yyyy-MM-dd" by converting the DynamicFrame to a Spark Data Frame, and converting the column, but the column is still set to null. Here is the conversion code:

from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import functions as F

def convert_date_fields(map, df, glueContext):
    sf = df.toDF()
    for mapping in map:
        logger.info(f'Check if {mapping[3]} is equal to date.')
        if mapping[3] == "date":
            logger.info(f'Converting {mapping[0]} to yyyy-mm-dd')
            sf = sf.withColumn(mapping[0], F.to_date(mapping[0], 'yyyy-MM-dd'))

    df = DynamicFrame.fromDF(sf, glueContext, "df")
    return df

Solution

  • The following properly converts the field into the right format:

    sf = sf.withColumn(mapping[0], F.date_format(F.unix_timestamp(mapping[0], "yyyy/MM/dd").cast("timestamp"), "yyyy-MM-dd"))