Search code examples
stringpysparktimestampunix-timestamp

PySpark: casting string as timestamp gives wrong time


I use the following code to cast a string type time timstm_hm to timestamp time timstm_hm_timestamp. Here is the code.

from pyspark.sql.functions import col, unix_timestamp
df = df.withColumn('timstm_hm_timestamp', unix_timestamp(col('timstm_hm'), "yyyy-mm-dd HH:mm").cast("timestamp"))

Here are the results.

-------------------------------------------------
|   timstm_hm         |   timstm_hm_timestamp   |  
-------------------------------------------------
|2018-02-08 11:04     | 2018-01-08 11:04:00     | 
-------------------------------------------------
|2018-02-27 20:34     | 2018-01-27 20:34:00     | 
-------------------------------------------------
|2018-02-23 19:47     | 2018-01-23 19:47:00     | 
-------------------------------------------------

Why there is one month difference between the conversion? It is very weird as it works for month January but not since February.


Solution

  • You just need to replace mm with capitals MM.

    Refer to java date format for more info: Javasimpledate

    from pyspark.sql.functions import col, unix_timestamp
    df.withColumn('timstm_hm_timestamp', unix_timestamp(col('timstm_hm'), "yyyy-MM-dd HH:mm").cast("timestamp")).show()
    
    +----------------+-------------------+
    |       timstm_hm|timstm_hm_timestamp|
    +----------------+-------------------+
    |2018-02-08 11:04|2018-02-08 11:04:00|
    +----------------+-------------------+
    

    Also, you can achieve same output by using just to_timestamp with capital MM.

    from pyspark.sql.functions import to_timestamp
    df.withColumn("timestm_hm_timestamp", to_timestamp("timstm_hm","yyyy-MM-dd HH:mm" )).show()
    
    +----------------+--------------------+
    |       timstm_hm|timestm_hm_timestamp|
    +----------------+--------------------+
    |2018-02-08 11:04| 2018-02-08 11:04:00|
    +----------------+--------------------+