Search code examples
scalaapache-sparkapache-spark-sqltimestamp-with-timezone

Spark 2.3 (Scala) - Convert a timestamp column from UTC to timezone specified in another column


I have a data frame with data like this:

    +----------------------+------------+
    | utc_timestamp        | tz_locale  |
    +----------------------+------------+
    |2021-07-16T10:00:00Z  | US/Eastern |
    |2021-07-19T15:00:00Z  | US/Central |
    +----------------------+------------+

I want to convert the timestamps from UTC (TZ 0) to the local time based on the value in the tz_locale column:

    +--------------------------+------------+
    | utc_timestamp            | tz_locale  |
    +--------------------------+------------+
    |2021-07-16T06:00:00-04:00 | US/Eastern |
    |2020-12-19T09:00:00-06:00 | US/Central |
    +--------------------------+------------+

I tried writing it like this:

val new_df = df.withColumn("utc_timestamp", from_utc_timestamp(df.col("utc_timestamp"), df.col("tz_locale")))

It appears from_utc_timestamp wants a String constant for the second argument, so it apparently only works to convert the entire column to the same timezone. But I need to convert each row dynamically based on the value of another column in that row.

I think this is possible in newer versions of Spark (from_utc_timestamp is overloaded with a version that takes (DataFrame.col, DataFrame.col)), but I am on 2.3 and upgrading is not an option. How can this be done in Spark 2.3? It seems like a fairly common task but I can't figure it out, and couldn't find anything using search.


Solution

  • For Spark 2.3 or older, you can take advantage of using the less type-constrained SQL expression via expr:

    df.withColumn("utc_timestamp", expr("from_utc_timestamp(utc_timestamp, tz_locale)")).show
    
    +-------------------+----------+
    |      utc_timestamp| tz_locale|
    +-------------------+----------+
    |2021-07-15 23:00:00|US/Eastern|
    |2021-07-19 03:00:00|US/Central|
    +-------------------+----------+