Search code examples
twitterpysparkapache-spark-sql

How to parse twitter date time string in pyspark?


My data is in pyspark dataframe ('pyspark.sql.dataframe.DataFrame'). One of the column has date time stored in Twitter string format.

I found a couple of solutions for python but there is no specific solution for pyspark.

This is how the column looks like.

+------------------------------+----+
|created_at(string format)     |date|
+------------------------------+----+
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
+------------------------------+----+

I tried following solution but it didn't work

date_df = df.select('created_at', from_unixtime(unix_timestamp('created_at', '%a %b %d %H:%M:%S %z %Y')).alias('date'))

I need to convert the column into spark datetime/timestamp type so I can perfrom other datetime and spark.sql operations on top of it.


Solution

  • I couldn't get any of the solution provided working. Apparently the to_timestamp() function on pyspark.sql reads the input in a certain format. All attempts of provide the format yielded no results. Hence using UDF (user defined function) was inevitable. But one has to note that the function cannot return python datetime. So now it has to be a two step process.

    1. UDF for parsing the date string, convert the format and return a string compatible with to_timestamp()
    2. Using to_timestamp() to convert the datatype in spark dataframe
    from datetime import datetime
    import pytz
    from pyspark.sql.functions import udf, to_date, to_utc_timestamp
    
    ## Converting date string format
    def getDate(x):
        if x is not None:
            return str(datetime.strptime(x,'%a %b %d %H:%M:%S +0000 %Y').replace(tzinfo=pytz.UTC).strftime("%Y-%m-%d %H:%M:%S"))
        else:
            return None
    
    ## UDF declaration
    date_fn = udf(getDate, StringType())
    
    ## Converting datatype in spark dataframe
    df = df.withColumn("created_at", to_utc_timestamp(date_fn("created_at"),"UTC"))