Search code examples
mysqlhadoopapache-sparkhiveimpala

How can i convert a timestamp to gmt format in hive


I have a timestamp column in my table, i'm deriving a column named dt_skey out of the timestamp column. For clear explanation lets assume that timestamp column name as time_column. This is how the time_column looks like 2017-02-05 03:33:50, dt_skey column looks like this 20170205033350 which is nothing but removing the symbols in between.

My question here : time_column is in est timezone, i want to convert it to the gmt format while i'm deriving dt_skey out of it. The reason why i want to do this is timestamp will be converted to the gmt format when we query it through impala where as dt_skey will not be converted as it is a int datatype. I'm doing the ingestion through hive where the timestamp and dt_skey column will be in sync when we query through hive. For the reporting purpose and users we use impala, so i want to make changes to the dt_skey column so that if user looks through impala both the columns should be in sync.

Below is the sql i'm using to derive dt_skey column out of a timestamp column:

cast(substr(regexp_replace(cast(time_column as string), '-',''),1,8) as int)as dt_skey

the above query will convert this 2017-02-02 07:32:51 into this 20170202.

Please help me to offset the dt_skey to GMT format. I also welcome solutions through spark.


Solution

  • In Spark:

    rdd = spark.sparkContext.parallelize([('2017-02-05 03:33:50',)])
    df = spark.createDataFrame(rdd, ['EST'])
    df = df.withColumn('GMT', f.to_utc_timestamp(df['EST'], 'EST'))
    res = df.withColumn('YouWanna', f.date_format(df['GMT'], 'yyyyMMddHHmmss'))
    res.show(truncate=False)
    
    +-------------------+---------------------+--------------+
    |EST                |GMT                  |YouWanna      |
    +-------------------+---------------------+--------------+
    |2017-02-05 03:33:50|2017-02-05 08:33:50.0|20170205083350|
    +-------------------+---------------------+--------------+
    

    Or in hive:

    select date_format(to_utc_timestamp('2017-02-05 03:33:50','EST'), 'yyyyMMddHHmmss') from dual
    

    Do you mean this ?