I am trying to convert string coming from s3/csv files to aurora mysql using sparksql (spark cluster 3.0) using Glue
Existing string value from csv: 20231021134021+0100 Expected timestamp: 2023-10-21 14:40:21 (added offset hour to time)
from awsglue.dynamicframe import DynamicFrame
# Use Spark SQL to parse the date-time string and add the time zone offset
spark_sql = glueContext.sql("""
SELECT
DATE_ADD(
to_timestamp(SUBSTRING(your_datetime_column, 1, 14), 'yyyyMMddHHmmss'),
INTERVAL CAST(SUBSTRING(your_datetime_column, 15, 2) AS INT) HOURS
+ CAST(SUBSTRING(your_datetime_column, 17, 2) AS INT) MINUTES
) AS
FROM my_data
""")
result_df = spark.sql(spark_sql)
result_dyf = DynamicFrame.fromDF(result_df, glueContext, "result_dyf")
I am getting a parsing error, is cast not supported with INTERVAL? and if I'm just using till timestamp like below
select to_timestamp(SUBSTRING(your_datetime_column, 1, 14), 'yyyyMMddHHmmss') from my_data
or select to_timestamp(SUBSTRING(your_datetime_column, 1, 14), 'yyyyMMddHHmmss') + INTERVAL 1 MINUTE from my_data both works well but not when i use cast from extracting offset
any better ways to handle on how to also include offset value inside timestamp?
In Spark SQL, the INTERVAL keyword generally expects a constant or a column value that has an appropriate interval type. So the CAST as INT before using it with INTERVAL doesn't work directly in this case.
What you need to do is:
I would change the sql to
# SQL query to transform the data
spark_sql = glueContext.sql("""
SELECT
from_utc_timestamp(
to_timestamp(SUBSTRING(your_datetime_column, 1, 14),
'yyyyMMddHHmmss'),
CONCAT('GMT', SUBSTRING(your_datetime_column, 15, 5))
) AS timestamp_with_offset
FROM my_data
""")
to_timestamp()
function converts the date-time part of the string into a timestamp.
from_utc_timestamp()
function applies the time zone offset to the timestamp.It expects the time zone in the format 'GMT+0100', which is created using string manipulation. So from_utc_timestamp
function takes the timestamp as being in UTC and converts it to the local time of the timezone specified.
This should correctly add the offset to the time.