My data is in pyspark dataframe ('pyspark.sql.dataframe.DataFrame'). One of the column has date time stored in Twitter string format.
I found a couple of solutions for python but there is no specific solution for pyspark.
This is how the column looks like.
+------------------------------+----+
|created_at(string format) |date|
+------------------------------+----+
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
+------------------------------+----+
I tried following solution but it didn't work
date_df = df.select('created_at', from_unixtime(unix_timestamp('created_at', '%a %b %d %H:%M:%S %z %Y')).alias('date'))
I need to convert the column into spark datetime/timestamp type so I can perfrom other datetime and spark.sql operations on top of it.
I couldn't get any of the solution provided working. Apparently the to_timestamp()
function on pyspark.sql reads the input in a certain format. All attempts of provide the format yielded no results. Hence using UDF (user defined function) was inevitable. But one has to note that the function cannot return python datetime. So now it has to be a two step process.
to_timestamp()
to_timestamp()
to convert the datatype in spark dataframefrom datetime import datetime
import pytz
from pyspark.sql.functions import udf, to_date, to_utc_timestamp
## Converting date string format
def getDate(x):
if x is not None:
return str(datetime.strptime(x,'%a %b %d %H:%M:%S +0000 %Y').replace(tzinfo=pytz.UTC).strftime("%Y-%m-%d %H:%M:%S"))
else:
return None
## UDF declaration
date_fn = udf(getDate, StringType())
## Converting datatype in spark dataframe
df = df.withColumn("created_at", to_utc_timestamp(date_fn("created_at"),"UTC"))