I was converting unix time to readable date/time string using Pyspark. I found that the conversion results sometimes seem to be incorrect, specifically, on 12/31/2017. So I used Python to double check:
ut1 = 1514696352860
#python
print('Use Python:', datetime.fromtimestamp(ut1/1000).strftime('%Y-%m-%d %H:%M:%S'))
# pyspark
time_df = spark.createDataFrame(pd.DataFrame({'unix_time': [ut1]}))
print('Use Pyspark:', time_df.withColumn('time', from_unixtime(time_df.unix_time/1000, 'YYYY-MM-dd HH:mm:ss'))\
.select('time').collect())
They get the same result:
Use Python: 2017-12-30 23:59:12
Use Pyspark: [Row(time='2017-12-30 23:59:12')]
But if I add, say, 15 minutes, so that it's on 12/31/2017:
# add 15mins
ut2 = ut1 + 15*60*1000
# python
print(datetime.fromtimestamp(ut2/1000).strftime('%Y-%m-%d %H:%M:%S'))
# pyspark
time_df = spark.createDataFrame(pd.DataFrame({'unix_time': [ut2]}))
print('Use Pyspark:',
time_df.withColumn('time', from_unixtime(time_df.unix_time/1000, 'YYYY-MM-dd HH:mm:ss'))\
.select('time').collect()
Then the results are different, Pyspark is incorrect (one-year off):
Use Python: 2017-12-31 00:14:12
Use Pyspark: [Row(time='2018-12-31 00:14:12')]
If I add 24 hours so that it's on 1/1/2018, the results are the same again (not going to repeat the code here):
Use Python: 2018-01-01 00:14:12
Use Pyspark: [Row(time='2018-01-01 00:14:12')]
My system default time zone is EST time, and I think both python and pyspark convert to the datetime in the current system time zone. I have also used Excel formula to calculate ((unixtime/1000/60/60/24)+DATE(1970,1,1)-5/24) and got the same results as Python.
Appreciate if anyone has any insight why this happens and how to fix it in pyspark. My dataset has almost 500 millions rows so it's ideal to use Pyspark. I could try using a UDF instead of the built-in unix_timestamp function. But any better solutions are welcome!
According to this JIRA ticket:
That is not a bug. You should use 'yyyy' instead of 'YYYY'.
print(
'Use Pyspark:',
time_df.withColumn(
'time',
from_unixtime(time_df.unix_time/1000, 'yyyy-MM-dd HH:mm:ss')
).select('time').collect()
)
#Use Pyspark: [Row(time='2017-12-31 00:14:12')]