Search code examples
sqltimezoneapache-spark-sqldsttimestamp-with-timezone

Converting day lights savings time string to timestamp gives wrong results


I have a pyspark data frame. In this data frame I have a column called test_time which is of string datatype

>>> df
DataFrame[test_time: string]

df.show()

+-------------------+
|          test_time|
+-------------------+
|2017-03-12 02:41:06|
|2017-03-12 02:43:52|
|2017-03-12 02:56:32|
|2017-03-12 03:16:23|
|2017-03-12 03:17:15|
|2017-03-12 03:22:19|
|2017-03-12 03:52:19|
|2017-03-12 04:03:21|
+-------------------+

Now I want to convert this test_time column from string to timestamp

I have done like below

from pyspark.sql import functions as F
df1 = df.withColumn('convert_test', F.unix_timestamp('test_time', "yyyy-MM-dd hh:mm:ss").cast('timestamp'))

>>> df1
DataFrame[test_time: string, convert_test: timestamp]

df1.show()

+-------------------+--------------------+
|          test_time|        convert_test|
+-------------------+--------------------+
|2017-03-12 02:41:06|2017-03-12 03:41:...|
|2017-03-12 02:43:52|2017-03-12 03:43:...|
|2017-03-12 02:56:32|2017-03-12 03:56:...|
|2017-03-12 03:16:23|2017-03-12 03:16:...|
|2017-03-12 03:17:15|2017-03-12 03:17:...|
|2017-03-12 03:22:19|2017-03-12 03:22:...|
|2017-03-12 03:52:19|2017-03-12 03:52:...|
|2017-03-12 04:03:21|2017-03-12 04:03:...|
+-------------------+--------------------+

As you can see the Hours are different for rows 1-3.

FYI my timezone is PST and the rows 1-3 are timings during day light savings time.

How can I get the correct conversion to be done.


Solution

  • I am getting the correct output with unix_timestamp()

      val dataframe = Seq(
        ("2017-03-12 02:41:06"),
        ("2017-03-12 02:43:52"),
        ("2017-03-12 02:56:32"),
        ("2017-03-12 03:16:23"),
        ("2017-03-12 03:17:15"),
        ("2017-03-12 03:22:19"),
        ("2017-03-12 03:52:19"),
        ("2017-03-12 04:03:21")
      ).toDF("test_time")
    
     dataframe.withColumn("convert_test", unix_timestamp($"test_time", "yyyy-MM-dd hh:mm:ss").cast("timestamp")).show()
    

    Output:

    +-------------------+--------------------+
    |          test_time|        convert_test|
    +-------------------+--------------------+
    |2017-03-12 02:41:06|2017-03-12 02:41:...|
    |2017-03-12 02:43:52|2017-03-12 02:43:...|
    |2017-03-12 02:56:32|2017-03-12 02:56:...|
    |2017-03-12 03:16:23|2017-03-12 03:16:...|
    |2017-03-12 03:17:15|2017-03-12 03:17:...|
    |2017-03-12 03:22:19|2017-03-12 03:22:...|
    |2017-03-12 03:52:19|2017-03-12 03:52:...|
    |2017-03-12 04:03:21|2017-03-12 04:03:...|
    +-------------------+--------------------+
    

    If you have in different timezone than you can use function like from_utc_timestamp() and to_utc_timestamp() to convert the timestamp.

    Hope this was helpful!