Search code examples
apache-sparkdatetimepysparkapache-spark-sqlparquet

Convert filetime to localtime in pyspark


I have a parquet file which contains a column as given below

+--------------------+
|     Start_Time     |
+--------------------+
| 131696565231590000 |
| 131696565251590000 |
| 131726072567140000 |
| 131726073867090000 |
| 131726076586810000 |
| 131726079656940000 |
| 131726081667130000 |
| 131726085077090000 |
| 131726085207190000 |
+--------------------+

I need to read this file and conver this column to a datetime column of the localtime. The expected output is as below (my local time is IST):

+-------------------------+
|       Start_Time        |
+-------------------------+
| 2018-05-01 13:55:23.160 |
| 2018-05-01 13:55:25.160 |
| 2018-06-04 17:34:16.713 |
| 2018-06-04 17:36:26.710 |
| 2018-06-04 17:40:58.680 |
| 2018-06-04 17:46:05.693 |
| 2018-06-04 17:49:26.713 |
| 2018-06-04 17:55:07.710 |
| 2018-06-04 17:55:20.720 |
+-------------------------+

I tried converting timestamps individually as shown below and adding +5.30 to it, but is there a better solution that I can use while I read the parquet file itself or after?

datetime.fromtimestamp((131696565231590000 - 116444736000000000) // 10000000)

Solution

  • You can use from_utc_timestamp and specify the IST timezone:

    df2 = df.selectExpr("""
        from_utc_timestamp(
            timestamp((Start_Time - 116444736000000000) / 10000000), 'IST'
        ) as Start_Time
    """)
    
    df2.show(truncate=False)
    +-----------------------+
    |Start_Time             |
    +-----------------------+
    |2018-05-01 19:25:23.159|
    |2018-05-01 19:25:25.159|
    |2018-06-04 23:04:16.714|
    |2018-06-04 23:06:26.709|
    |2018-06-04 23:10:58.681|
    |2018-06-04 23:16:05.694|
    |2018-06-04 23:19:26.713|
    |2018-06-04 23:25:07.709|
    |2018-06-04 23:25:20.719|
    +-----------------------+