Search code examples
datepysparkaws-glueaws-glue-data-catalog

what is the best way to cast or handle the date datatype in pyspark


Can you please help me to cast the below datatype in pyspark in the better possible way? we cant handle this in the dataframe.

Input:

Aug 11, 2020 04:34:54.0 PM

to expected output:

2020-08-11 04:34:54:00 PM

Solution

  • Try with from_unixtime, unix_timestamp functions.

    Example:

    #sample data in dataframe
    df.show(10,False)
    #+--------------------------+
    #|ts                        |
    #+--------------------------+
    #|Aug 11, 2020 04:34:54.0 PM|
    #+--------------------------+
    
    df.withColumn("dt",from_unixtime(unix_timestamp(col("ts"),"MMM d, yyyy hh:mm:ss.SSS a"),"yyyy-MM-dd hh:mm:ss.SSS a")).\
    show(10,False)
    #+--------------------------+--------------------------+
    #|ts                        |dt                        |
    #+--------------------------+--------------------------+
    #|Aug 11, 2020 04:34:54.0 PM|2020-08-11 04:34:54.000 PM|
    #+--------------------------+--------------------------+
    

    If you want new column to be timestamp type then use to_timestamp function in spark.

    df.withColumn("dt",to_timestamp(col("ts"),"MMM d, yyyy hh:mm:ss.SSS a")).\
    show(10,False)
    #+--------------------------+-------------------+
    #|ts                        |dt                 |
    #+--------------------------+-------------------+
    #|Aug 11, 2020 04:34:54.0 PM|2020-08-11 16:34:54|
    #+--------------------------+-------------------+
    
    df.withColumn("dt",to_timestamp(col("ts"),"MMM d, yyyy hh:mm:ss.SSS a")).printSchema()
    #root
    # |-- ts: string (nullable = true)
    # |-- dt: timestamp (nullable = true)