Search code examples
apache-sparkapache-spark-sqlaws-glueamazon-aurora

convert string with UTC offset to spark timestamp offset


I am trying to convert string coming from s3/csv files to aurora mysql using sparksql (spark cluster 3.0) using Glue

Existing string value from csv: 20231021134021+0100 Expected timestamp: 2023-10-21 14:40:21 (added offset hour to time)

from awsglue.dynamicframe import DynamicFrame


# Use Spark SQL to parse the date-time string and add the time zone offset
spark_sql = glueContext.sql("""
    SELECT
           DATE_ADD(
               to_timestamp(SUBSTRING(your_datetime_column, 1, 14), 'yyyyMMddHHmmss'),
               INTERVAL CAST(SUBSTRING(your_datetime_column, 15, 2) AS INT) HOURS
               + CAST(SUBSTRING(your_datetime_column, 17, 2) AS INT) MINUTES
           ) AS 
    FROM my_data
""")

result_df = spark.sql(spark_sql)

result_dyf = DynamicFrame.fromDF(result_df, glueContext, "result_dyf")

I am getting a parsing error, is cast not supported with INTERVAL? and if I'm just using till timestamp like below

select to_timestamp(SUBSTRING(your_datetime_column, 1, 14), 'yyyyMMddHHmmss') from my_data

or select to_timestamp(SUBSTRING(your_datetime_column, 1, 14), 'yyyyMMddHHmmss') + INTERVAL 1 MINUTE from my_data both works well but not when i use cast from extracting offset

any better ways to handle on how to also include offset value inside timestamp?


Solution

  • In Spark SQL, the INTERVAL keyword generally expects a constant or a column value that has an appropriate interval type. So the CAST as INT before using it with INTERVAL doesn't work directly in this case.

    What you need to do is:

    • Parse the date-time string into a timestamp.
    • Add the UTC offset to the parsed timestamp.

    I would change the sql to

    # SQL query to transform the data
    spark_sql = glueContext.sql("""
        SELECT
            from_utc_timestamp(
                to_timestamp(SUBSTRING(your_datetime_column, 1, 14), 
             'yyyyMMddHHmmss'),
                CONCAT('GMT', SUBSTRING(your_datetime_column, 15, 5))
            ) AS timestamp_with_offset
        FROM my_data
    """)
    

    to_timestamp() function converts the date-time part of the string into a timestamp.

    from_utc_timestamp() function applies the time zone offset to the timestamp.It expects the time zone in the format 'GMT+0100', which is created using string manipulation. So from_utc_timestamp function takes the timestamp as being in UTC and converts it to the local time of the timezone specified.

    This should correctly add the offset to the time.