Search code examples
scalaapache-sparkparsingtimestampspark3

Scala: Parse timestamp using spark 3.1.2


I have an Excel-reader, where I put the results in sparks dataframes. I have problems with parsing the timestamps.

I have timestamps as strings like Wed Dec 08 10:49:59 CET 2021. I was using spark-sql version 2.4.5 and everything worked fine until I recently updated to version 3.1.2.

Please find some minimal code below.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}

val ts: String = "Wed Dec 08 20:49:59 CET 2021"
val oldfmt: String = "E MMM dd HH:mm:ss z yyyy"

val ttdf = Seq(ts)
  .toDF("theTimestampColumn")
  .withColumn("parsedTime", to_timestamp(col("theTimestampColumn"), fmt = oldfmt))

ttdf.show()

Running this code with spark version 2.4.5 works like expected and produces the following output:

+--------------------+-------------------+
|  theTimestampColumn|         parsedTime|
+--------------------+-------------------+
|Wed Dec 08 20:49:...|2021-12-08 20:49:59|
+--------------------+-------------------+

Now, executing the same code, just with spark version 3.1.2, results in the following error:

Exception in thread "main" org.apache.spark.SparkUpgradeException: 
You may get a different result due to the upgrading of Spark 3.0: 
Fail to recognize 'E MMM dd HH:mm:ss z yyyy' pattern in the DateTimeFormatter. 
1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 
2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

(clickable link: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)

This website doesn't help me further. I don't find any mistakes in my formatstring. The symbol E represents the day-of-week as text like Tue; Tuesday. The symbol M represents the month-of-year like 7; 07; Jul; July. The symbols H,m,s,y are hours, minutes, seconds or years, respectively. The symbol z denotes the time-zone name like Pacific Standard Time; PST. Do I miss something obvious here?

Any help will be really appreciated. Thank you in advance.


Solution

  • You can use E only for datetime formatting and not for parsing, as stated in datetime pattern documentation:

    Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.

    If you want to apply behavior of Spark version <3.0, you can set spark.sql.legacy.timeParserPolicy option to LEGACY:

    sparkSession.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
    

    And if you don't want to change spark configuration, you can remove the characters representing day with substr SQL function:

    import org.apache.spark.sql.functions.{col, to_timestamp, expr}
    
    val ts: String = "Wed Dec 08 20:49:59 CET 2021"
    val fmt: String = "MMM dd HH:mm:ss z yyyy"
    
    val ttdf = Seq(ts)
      .toDF("theTimestampColumn")
      .withColumn("preparedTimestamp", expr("substr(theTimestampColumn, 5, length(theTimestampColumn))"))
      .withColumn("parsedTime", to_timestamp(col("preparedTimestamp"), fmt = fmt))
      .drop("preparedTimestamp")