Search code examples
rapache-sparksparkr

SparkR. Convert UTC to local time for all observations, multiple timezones


I am new to SparkR, learning as I go, and on version 2.2.0 - SNAPSHOT. I have a SparkDataFrame with one column of UTC timestamps, one column of time zones. For example:

mydates<- c("2012-12-22 14:23:05", "2015-01-19 16:37:42", "2017-03-02 08:22:55")
mytimeZones<- c("America/Chicago", "America/New_York", "Pacific/Galapagos")
dfr<- data.frame(date=mydates, timezone=mytimeZones)

dfs<- SparkR::createDataFrame(dfr)

What I would like to do is make a column of local times for each observation in my dataset. I saw that the from_utc_timestamp() function takes in a column of timestamps and a string representing the desired timezone conversion. My problem is that I have columns of both. Given this, I am wondering what the workflow would need to be, and if someone with more experience with SparkR might be willing to shed insight into how I can tackle this problem? At the moment, I have tried things like this:

dftest<- withColumn(dfs, "LocalTime", 
  SparkR:::foreach(dfs, function(x){from_utc_timestamp(y=x$date, x=x$timezone)})
)

But with no success. I'm hoping someone out there has tackled this problem in SparkR and could help me. Thanks for your time. Kind Regards, Nate


Solution

  • You can use either selectExpr:

    selectExpr(dfs, "*", "from_utc_timestamp(date, timezone) AS LocalTime")
    

    or expr:

    withColumn(dfs, "LocalTime", expr("from_utc_timestamp(date, timezone)"))
    

    In a simpler case, where timezone is a constant string calling from_utc_timestamp directly would do the trick:

    withColumn(dfs, "LocalTime", 
      from_utc_timestamp(column("date"), "America/New_York")
    

    Unfortunately there is no from_utc_timestamp with signature (Column, Column) => Column, hence the SQL-ish solution.