Search code examples
apache-sparkpysparkapache-spark-sqltimestamptimestamp-with-timezone

Converting string with timezone to timestamp spark 3.0


I'm using databricks to ingest a csv and have a column that needs casting from a string to a timestamp. The data comes in as a string in this format: 31-MAR-27 10.59.00.000000 PM GMT

The code I'm using is python, and the cluster is running spark 3.0.1. I've used the below code before but not with the AM/PM or timezone and I can't get a format that doesn't return null for every value

df.withColumn('columnName', to_timestamp(col('columnName'),'dd-MMM-yy HH.mm.ss.SSSSSS a zzz').cast(TimestampType()))

I'm using this documentation https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html to help


Solution

  • Since you have am/pm in your timestamp, you should use lower case h for clock-hour-of-am-pm (1-12), but not H which stands for hour-of-day (0-23).

    df2 = df.withColumn(
        'columnName', 
        to_timestamp(col('columnName'), 'dd-MMM-yy hh.mm.ss.SSSSSS a zzz')
    )
    

    Note that there is no need to cast to timestamp type because to_timestamp already returns a timestamp type column.